1. You make a good point about the GB9c. It should probably instead be something like:
GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant Extend is a broader than necessary, and there are a few items that have ccc!=0 but not gcb=extend. But all of those look to be degenerate cases. https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory Mark <https://twitter.com/mark_e_davis> On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode < [email protected]> wrote: > Apart from the likely but unmandated consequence of making editing > Indic text more difficult (possibly contrary to the UK's Equality Act > 2010), there is another difficulty that will follow directly from the > currently proposed expansion of grapheme clusters > (https://www.unicode.org/reports/tr29/proposed.html). > > Unless I am missing something, text boundaries have hitherto been > cunningly crafted so that they are not changed by normalisation. > Have I missed something, or has there been a change in policy? > > For extended grapheme clusters, the relevant rules are proposed as: > > GB9: × (Extend | ZWJ | Virama) > > GB9c: (Virama | ZWJ ) × LinkingConsonant > > Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9). > This would lead canonically equivalent text to have strikingly > different divisions: > > <consonant, nukta, virama, consonant> (no break) > > but > > <consonant, virama, nukta | consonant> > > There are other variations on this theme. In Tai Tham, we have the > following conflict: > > natural order, no break: > > <consonant, non-spacing-vowel, tone-mark, sakot, consonant> > > but normalised, there would be a break: > > <consonant, non-spacing-vowel, sakot, tone-mark | consonant> > > From reading the text, it seems that it is expected that the presence > or absence of a break should be fine-tuned by CLDR language-specific > rules. How is this expected to work, e.g. for Saurashtra in Tamil > script? (There's no Saurashtra data in Version 32 of CLDR.) Would the > root locale now specify the default segmentation rule, rather than > UAX#29 plus the Unicode Character Database? > > Richard. > >

