On Thu, 22 Dec 2016 14:05:18 -0800 Manish Goregaokar <man...@mozilla.com> wrote:
> I guess the confusion is, with → rules, do we apply them globally, or > only apply them when considering subsequent rules? I would say the latter. The logic is that you apply the whole set of rules on either side of each character. > I suspect the answer here is that you only apply them in order. The > list of rules is not a list of precedences, but rather a list with the > order in which the rules are applied. So a → rule means "Treat the > left side as if it were the right side in the context of all > subsequent rules" I would indeed say that you apply them in order. The relevant example in the test suite (file auxiliary/WordBreakTest.txt in the UCD) is: ÷ 000D ÷ 0308 ÷ 000A ÷ Now, I am not sure if it is possible to automatically turn the rules into an automatic break iterator based on regular expressions. The last time I looked, ICU was doing this by manual conversion. I would therefore deduce that such a conversion is impossible, difficult, or produces highly inefficient code. ICU has the added complication that it also needs to invoke real Southeast Asian break iterators. When I looked, their interface was not returning appropriate word-break properties for the characters, but was itself a break iterator. Richard.