On Wed, 15 Nov 2023 00:09:17 GMT, Jonathan Gibbons <j...@openjdk.org> wrote:

>> test/langtools/tools/javac/doctree/FirstSentenceTest.java line 423:
>> 
>>> 421: DocComment[DOC_COMMENT, pos:0
>>> 422:   firstSentence: 1
>>> 423:     RawText[MARKDOWN, pos:0, abc.|def.]
>> 
>> It took me a while to understand why this test expects the first sentence to 
>> include the second line:
>> 
>>     ///abc.
>>     ///def.
>>     void simpleMarkdown() { }
>> 
>> It's not because it's Markdown or the new `///` comment syntax. It's because 
>> the breakiterator thinks that `abc.\ndef.` is one sentence.
>> 
>> Now, in this same file, on [line 
>> 123](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/tools/javac/doctree/FirstSentenceTest.java#L119-L142),
>>  there's this test:
>> 
>>     /**
>>      * abc def ghi.
>>      * jkl mno pqr
>>      */
>>     void dot_newline() { }
>> 
>> If you compare its expectation to that of simpleMarkdown(), you'll see that 
>> the first sentence consists of the first line only:
>> 
>>     BREAK_ITERATOR
>>     DocComment[DOC_COMMENT, pos:1
>>       firstSentence: 1
>>         Text[TEXT, pos:1, abc_def_ghi.]
>>       body: 1
>>         Text[TEXT, pos:15, jkl_mno_pqr]
>>       block tags: empty
>>     ]
>> 
>> So, why does it seem to work differently for `///` and `/** */` comments? It 
>> turns out that a whitespace character immediately after newline is important 
>> for breakiterator to do its thing:
>> ```    
>> ​    abc def ghi.\n jkl mno pqr
>> ​                  ^
>> 
>> The problem with the simpleMarkdown test is that we cannot just indent the 
>> comment. That indentation will be deemed incidental whitespace and, thus, 
>> will removed. For example, suppose we do this:
>> 
>>     /// abc.
>>     /// def.
>>     void simpleMarkdown() { }
>> 
>> The breakiterator will still see `abc.\ndef.`. Of course, we could do this:
>> 
>>     ///abc.
>>     /// def.
>>     void simpleMarkdown() { }
>> 
>> But it would be a different comment.
>> 
>> There's another test in this PR, 
>> [TestMarkdown.testFirstSentence](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/jdk/javadoc/doclet/testMarkdown/TestMarkdown.java#L68-L83).
>>  There, the first sentence consists only of the first line. This is likely 
>> because the second line starts with a capital letter.
>
> Good and interesting catch.
> 
> As a general matter, there are two scenarios to consider: using a 
> system-provided `BreakIterator` and using the fallback "simple iterator" in 
> DocTreeMaker.getSentenceBreak.  
> 
> The latter is just a period followed by a single whitespace, or a paragraph 
> break if there is no period.
> 
> The system-provided `BreakIterator` is locale specific. I cannot easily find 
> a public spec of the sentence breaker for the US locale, but experiments 
> indicate it wants one of 
> 
> * period, single whitespace, uppercase letter
> * period, multiple whitespace
> 
> Arguably, the test is lazy/broken for using all lower case, which is an 
> atypical form for sentences and use of a sentence breaker. I'll modify at 
> least the new parts of the test (using Markdown comments.). We can separately 
> decide whether to likewise modify the original parts (using traditional 
> comments) or whether to do that as a separate `noreg-self` bug fix for this 
> test.

After modifying the test to capitalize the pseudo-sentences, the test again 
passes, and the output with the break iterator is now the same as the default 
simple iterator. Yay.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16388#discussion_r1393474011

Reply via email to