On Wed, 15 Nov 2023 00:09:17 GMT, Jonathan Gibbons <[email protected]> wrote:
>> test/langtools/tools/javac/doctree/FirstSentenceTest.java line 423:
>>
>>> 421: DocComment[DOC_COMMENT, pos:0
>>> 422: firstSentence: 1
>>> 423: RawText[MARKDOWN, pos:0, abc.|def.]
>>
>> It took me a while to understand why this test expects the first sentence to
>> include the second line:
>>
>> ///abc.
>> ///def.
>> void simpleMarkdown() { }
>>
>> It's not because it's Markdown or the new `///` comment syntax. It's because
>> the breakiterator thinks that `abc.\ndef.` is one sentence.
>>
>> Now, in this same file, on [line
>> 123](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/tools/javac/doctree/FirstSentenceTest.java#L119-L142),
>> there's this test:
>>
>> /**
>> * abc def ghi.
>> * jkl mno pqr
>> */
>> void dot_newline() { }
>>
>> If you compare its expectation to that of simpleMarkdown(), you'll see that
>> the first sentence consists of the first line only:
>>
>> BREAK_ITERATOR
>> DocComment[DOC_COMMENT, pos:1
>> firstSentence: 1
>> Text[TEXT, pos:1, abc_def_ghi.]
>> body: 1
>> Text[TEXT, pos:15, jkl_mno_pqr]
>> block tags: empty
>> ]
>>
>> So, why does it seem to work differently for `///` and `/** */` comments? It
>> turns out that a whitespace character immediately after newline is important
>> for breakiterator to do its thing:
>> ```
>> abc def ghi.\n jkl mno pqr
>> ^
>>
>> The problem with the simpleMarkdown test is that we cannot just indent the
>> comment. That indentation will be deemed incidental whitespace and, thus,
>> will removed. For example, suppose we do this:
>>
>> /// abc.
>> /// def.
>> void simpleMarkdown() { }
>>
>> The breakiterator will still see `abc.\ndef.`. Of course, we could do this:
>>
>> ///abc.
>> /// def.
>> void simpleMarkdown() { }
>>
>> But it would be a different comment.
>>
>> There's another test in this PR,
>> [TestMarkdown.testFirstSentence](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/jdk/javadoc/doclet/testMarkdown/TestMarkdown.java#L68-L83).
>> There, the first sentence consists only of the first line. This is likely
>> because the second line starts with a capital letter.
>
> Good and interesting catch.
>
> As a general matter, there are two scenarios to consider: using a
> system-provided `BreakIterator` and using the fallback "simple iterator" in
> DocTreeMaker.getSentenceBreak.
>
> The latter is just a period followed by a single whitespace, or a paragraph
> break if there is no period.
>
> The system-provided `BreakIterator` is locale specific. I cannot easily find
> a public spec of the sentence breaker for the US locale, but experiments
> indicate it wants one of
>
> * period, single whitespace, uppercase letter
> * period, multiple whitespace
>
> Arguably, the test is lazy/broken for using all lower case, which is an
> atypical form for sentences and use of a sentence breaker. I'll modify at
> least the new parts of the test (using Markdown comments.). We can separately
> decide whether to likewise modify the original parts (using traditional
> comments) or whether to do that as a separate `noreg-self` bug fix for this
> test.
After modifying the test to capitalize the pseudo-sentences, the test again
passes, and the output with the break iterator is now the same as the default
simple iterator. Yay.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16388#discussion_r1393474011