UAX #29 and WB4

Daniel Bünzli via Unicode Wed, 04 Mar 2020 09:29:24 -0800

Hello, 

My implementation of word break chokes only on the following test case from the 
file [1]:


÷ 0020 × 0308 ÷ 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS 
(Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3] 

I find: 

÷ 0020 × 0308 × 0020 ÷

Basically my implementation uses WB4 to rewrite the first two characters to 
WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 
0020.

Re-reading the text I suspect I should not restart the rules from the first one 
when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct 
? 

Best, 

Daniel

[1]: https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt

UAX #29 and WB4

Reply via email to