In LineBreakTest.txt, there are test cases that indicate there should *not* be 
a break after U+0308, however, the LB rule cited does not appear to apply and 
it would appear that there *should* be a break. For example:

× 000A ÷ 0308 × 23E9 ÷ #  × [0.3] <LINE FEED (LF)> (LF_NotEastAsian) ÷ [5.03] 
COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [28.0] BLACK RIGHT-POINTING DOUBLE 
TRIANGLE (AL) ÷ [0.3]

LB28 states "Do not break between alphabetics (“at”)" with the following break 
rule:

(AL | HL) × (AL | HL)

However, in the aforementioned test case, neither U+000A nor U+0308 has break 
class AL or HL (they have break class LF and CM). Yet rule 28.0 is cited as the 
reason for not breaking between U+0308 and U+23E9. It would appear that there 
_should_ be a break here.

Likewise, for the test:

× 200B ÷ 0308 × 0024 ÷ #  × [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ÷ [8.0] 
COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [24.03] DOLLAR SIGN 
(PR_NotEastAsian) ÷ [0.3]

LB24 states "Do not break between alphabetics (“at”)" with the following break 
rule:

(PR | PO) × (AL | HL)
(AL | HL) × (PR | PO)

However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they have 
break class ZW and CM). Yet rule 24.03 is cited as the reason for not breaking 
between U+0308 and U+0024. It would appear that there _should_ be a break here.

In total, I have collected ~80 test cases from LineBreakTest.txt that exhibit 
this same pattern.

I'm wondering if these test cases were meant to have a hyphen character because 
then they'd respect rule LB20a which states "Do not break after a word-initial 
hyphen". This rule has the definition:

( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | [\u2010] ) × AL

So, for example, test case:

× 000A ÷ 0308 × 23E9 ÷ #  LF ÷ CM × AL  (incorrect?)

would become:

× 000A ÷ 0308 ÷ 002D × 23E9 ÷ #  LF ÷ CM ÷ HY × AL  (correct)

Reply via email to