The proposed rules do not distinguish the different visual forms that a sequence of characters surrounding a virama can have, such as
1. an explicit virama, or 2. a half-form is visible, or 3. a ligature is created. That is following the requested structure in http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf. So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct Forms in Devanagari <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>) doesn't break a GC, nor do instances where a particular script always shows an explicit virama between two particular consonants. All the lines on Figure 12-7. Consonant Forms in Devanagari and Oriya <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257> having a virama would have single GCs (that is, all but the first line). [That, after correcting the rules as per Manish Goregaokar's feedback, thanks!] The examples in "Annexure B" of 17200-text-seg-rec.pdf <http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly include #2 and #3, but don't have any examples of #1 (as far as I can tell from a quick scan). It would be very useful to have explicit examples that included #1, and included scripts other than Devanagari (+swaran, others). While the online tool at http://unicode.org/cldr/utility/breaks.jsp can't yet be used until the Unicode 11 UCD is further along, I have an implementation of the new rules such that I can take any particular list of words and generate the breaks. So if someone can supply examples from different scripts or with different combinations of virama, zwj, zwnj, etc..... I can push out the result to this list. And yes, we do need review of these for Malayalam (+cibu, others). If there are scripts for which the rules really don't work (or need more research before #29 is finalized in May), it is fairly straightforward to restrict the rule changes by modifying http://www.unicode.org/reports/tr29/proposed.html#Virama to either exclude particular scripts or include only particular scripts. Mark <https://twitter.com/mark_e_davis> On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode < [email protected]> wrote: > On Sat, 9 Dec 2017 16:16:44 +0100 > Mark Davis ☕️ via Unicode <[email protected]> wrote: > > > 1. You make a good point about the GB9c. It should probably instead be > > something like: > > > > GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant > > > > > > Extend is a broader than necessary, and there are a few items that > > have ccc!=0 but not gcb=extend. But all of those look to be > > degenerate cases. > > Something *like*. > > Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA. I believe > these both prevent a preceding candrakkala from extending an akshara - > see TUS Section 12.9 about Table 12-33. I think Extend will have to be > split between starters and non-starters. > > I believe there is a problem with the first two examples in Table > 12-33. If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM > VOWEL SIGN AA> to the first two examples, yielding *പാലു്കാ and > *എ്ന്നാകാ, one would have three Malayalam aksharas, not two extended > grapheme clusters as the proposed rules would say. This is different to > Tai Tham, where there would indeed just be two aksharas in each word, > albit odd-looking - ᨷᩤᩃᩩ᩠ᨠᩣ and ᩑ᩠ᨶ᩠ᨶᩣᨠᩣ. Who's checking the impact of > these changes on Malayalam? > > Richard. > >

