On Thu, 17 Oct 2019 23:11:55 +0100 Richard Wordingham via Unicode <unicode@unicode.org> wrote:
> There seems to be a Unicode non-compliance (C6) issue in the > definition of collation grapheme clusters (defined in UTS#10 Section > 9.9). Using the DUCET collation, the canonically equivalent strings > รู้ <U+0E23 THAI CHARACTER RO RUA, U+0E39 THAI CHARACTER SARA UU, > U+0E49 THAI CHARACTER MAI THO> and รัู <U+0E23, U+0E49, U+0E39> > decompose into collation grapheme clusters in two different ways. > The first decomposes into <U+0E23> and <U+0E39, U+0E49> and the > second decomposes into <U+0E23, U+0E49> and <U+0E39>. Correction: One has to take the collating elements in NFD order, so the tone mark (secondary weight) and the vowel (primary weight) also form a cluster, so the division into clusters is <U+0E23>, <U+0E49, U+0E39>. This split respects canonical equivalence. Replacement: Now, one form of typo one may see in Thai is where the vowel is typed twice. Thai fonts often lack mark-to-mark positioning for sequences that should not occur, so the two copies of the vowel may be overlaid. Proof-reading will not spot the mistake if the font or layout engine does not assist. Thus we can get <U+0E23, U+0E39, U+0E39, U+0E49> (417,000 raw Google hits, the first 10 all good). That splits into *three* collation grapheme clusters - <U+0E23>, <U+0E39> and <U+0E39, U+0E49>. Its canonical equivalence <U+0E23, U+0E49, U+0E39, U+0E39> splits into two grapheme clusters, for to form a sequence of collating elements without skipping starting at the U+0E49, one must take all three characters. Overall, we end up with *two* collation grapheme clusters, <U+0E23> and <U+0E49, U+0E39, U+0E39>. > Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this > requirement, an implementation shall provide for collation grapheme > clusters matches based on a locale's collation order", requires > canonically equivalent sequences to be interpreted differently. Richard.