On Wed, 21 Dec 2016 15:24:21 -0800 Manish Goregaokar <man...@mozilla.com> wrote:
> Aside from that, WB4's[6] greediness is underspecified. In previous > versions, the rule was <snip> > However, now the rule is > > > X (Extend | Format | ZWJ)* → X > > The problem here is that ZWJ appears in the previous rule as well, > WB3c[7]: > > > ZWJ × (Glue_After_Zwj | EBG) > > which says that we should not break between a ZWJ and a GAZ ("Glue > After ZWJ") character. > > WB3c has precedence over WB4, which means that a sequence like > `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the > ZWJ is collapsed into the Emoji_Base. This is fine. > > However, more complicated sequences depend on the greediness of the > Kleene star in WB4. For example, take the sequence `Emoji_Base Extend > ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply > since we have a Extend/ZWJ sequence. > > WB4 can apply in multiple ways. If it is applied greedily, we get > `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed > characters). This does break since you don't break between Emoji_Base > and EBG. > > However, we can apply it conservatively instead. We can get > `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't > collapse. >From your terminology, I think you have an error in your transformation to a 'regular' expression. Why don't you have the same problem when you determine word breaks in CR Extend LF? I'm guessing that you have some mechanism that makes WB3 (CR × LF) redundant. Rule WB3c does *not* transform to ZWJ(...) × (Glue_After_Zwj | EBG) Naively, I would say that WB4 can be reapplied to `Emoji_Base(..) ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break. Richard.