> GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant You can also explicitly request ligatureification with a ZWJ, so perhaps this rule should be something like
(Virama ZWJ? | ZWJ) x Extend* LinkingConsonant -Manish On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode < unicode@unicode.org> wrote: > 1. You make a good point about the GB9c. It should probably instead be > something like: > > GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant > > > Extend is a broader than necessary, and there are a few items that have > ccc!=0 but not gcb=extend. But all of those look to be degenerate cases. > > https://unicode.org/cldr/utility/list-unicodeset.jsp?a= > [\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory > <https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%5Cp%7Bccc!=0%7D-%5Cp%7Bgcb=extend%7D]&g=ccc+indicsyllabiccategory> > > > > Mark <https://twitter.com/mark_e_davis> > > On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > >> Apart from the likely but unmandated consequence of making editing >> Indic text more difficult (possibly contrary to the UK's Equality Act >> 2010), there is another difficulty that will follow directly from the >> currently proposed expansion of grapheme clusters >> (https://www.unicode.org/reports/tr29/proposed.html). >> >> Unless I am missing something, text boundaries have hitherto been >> cunningly crafted so that they are not changed by normalisation. >> Have I missed something, or has there been a change in policy? >> >> For extended grapheme clusters, the relevant rules are proposed as: >> >> GB9: × (Extend | ZWJ | Virama) >> >> GB9c: (Virama | ZWJ ) × LinkingConsonant >> >> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9). >> This would lead canonically equivalent text to have strikingly >> different divisions: >> >> <consonant, nukta, virama, consonant> (no break) >> >> but >> >> <consonant, virama, nukta | consonant> >> >> There are other variations on this theme. In Tai Tham, we have the >> following conflict: >> >> natural order, no break: >> >> <consonant, non-spacing-vowel, tone-mark, sakot, consonant> >> >> but normalised, there would be a break: >> >> <consonant, non-spacing-vowel, sakot, tone-mark | consonant> >> >> From reading the text, it seems that it is expected that the presence >> or absence of a break should be fine-tuned by CLDR language-specific >> rules. How is this expected to work, e.g. for Saurashtra in Tamil >> script? (There's no Saurashtra data in Version 32 of CLDR.) Would the >> root locale now specify the default segmentation rule, rather than >> UAX#29 plus the Unicode Character Database? >> >> Richard. >> >> >