On Wed, 15 Nov 2023 00:09:17 GMT, Jonathan Gibbons <j...@openjdk.org> wrote:
>> test/langtools/tools/javac/doctree/FirstSentenceTest.java line 423: >> >>> 421: DocComment[DOC_COMMENT, pos:0 >>> 422: firstSentence: 1 >>> 423: RawText[MARKDOWN, pos:0, abc.|def.] >> >> It took me a while to understand why this test expects the first sentence to >> include the second line: >> >> ///abc. >> ///def. >> void simpleMarkdown() { } >> >> It's not because it's Markdown or the new `///` comment syntax. It's because >> the breakiterator thinks that `abc.\ndef.` is one sentence. >> >> Now, in this same file, on [line >> 123](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/tools/javac/doctree/FirstSentenceTest.java#L119-L142), >> there's this test: >> >> /** >> * abc def ghi. >> * jkl mno pqr >> */ >> void dot_newline() { } >> >> If you compare its expectation to that of simpleMarkdown(), you'll see that >> the first sentence consists of the first line only: >> >> BREAK_ITERATOR >> DocComment[DOC_COMMENT, pos:1 >> firstSentence: 1 >> Text[TEXT, pos:1, abc_def_ghi.] >> body: 1 >> Text[TEXT, pos:15, jkl_mno_pqr] >> block tags: empty >> ] >> >> So, why does it seem to work differently for `///` and `/** */` comments? It >> turns out that a whitespace character immediately after newline is important >> for breakiterator to do its thing: >> ``` >> abc def ghi.\n jkl mno pqr >> ^ >> >> The problem with the simpleMarkdown test is that we cannot just indent the >> comment. That indentation will be deemed incidental whitespace and, thus, >> will removed. For example, suppose we do this: >> >> /// abc. >> /// def. >> void simpleMarkdown() { } >> >> The breakiterator will still see `abc.\ndef.`. Of course, we could do this: >> >> ///abc. >> /// def. >> void simpleMarkdown() { } >> >> But it would be a different comment. >> >> There's another test in this PR, >> [TestMarkdown.testFirstSentence](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/jdk/javadoc/doclet/testMarkdown/TestMarkdown.java#L68-L83). >> There, the first sentence consists only of the first line. This is likely >> because the second line starts with a capital letter. > > Good and interesting catch. > > As a general matter, there are two scenarios to consider: using a > system-provided `BreakIterator` and using the fallback "simple iterator" in > DocTreeMaker.getSentenceBreak. > > The latter is just a period followed by a single whitespace, or a paragraph > break if there is no period. > > The system-provided `BreakIterator` is locale specific. I cannot easily find > a public spec of the sentence breaker for the US locale, but experiments > indicate it wants one of > > * period, single whitespace, uppercase letter > * period, multiple whitespace > > Arguably, the test is lazy/broken for using all lower case, which is an > atypical form for sentences and use of a sentence breaker. I'll modify at > least the new parts of the test (using Markdown comments.). We can separately > decide whether to likewise modify the original parts (using traditional > comments) or whether to do that as a separate `noreg-self` bug fix for this > test. After modifying the test to capitalize the pseudo-sentences, the test again passes, and the output with the break iterator is now the same as the default simple iterator. Yay. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16388#discussion_r1393474011