Peter Constable wrote: > Your doc says, > > <quote, emphasis added> > And à should be ordered as à followed by à (**which is the > logical sequence, despite the Unicode compatibility decomposition**). > </quote> > > What do you mean here by "logical sequence"? That that's how > it should be interpreted phonologically and for sorting > purposes,
Yes. > or that that is the correct encoded sequence for > decomposed representations? Well, it appears that sara am is rarely decomposed in practice (unless one applies NFKD or NFKC, like for IDNs). However, the spelling convention in Khmer, where the nikhahit looks much like it does for Thai and Lao, appears to be to have the nikhahit after the vowel mark (and there are no compatibility precomposed forms). Ideally the <C, dep. vowel, nikhahit> sequence should be interpreted the same as <C, nikhahit, dep. vowel> for Thai, Lao, and Khmer (for their respective nikhahits). But all of the nikhahits have combining class 0, so that will not follow from Unicode equivalences. For collation, at least, my suggestion (in the referred documents) is to treat them as equivalent for the orthographically used combinations in Thai, Lao, and Khmer. > If the latter, that seems to me to be quite wrong: I would > not expect *any* data that includes a decomposed > representation of sara am to have the sequence < C, sara aa, > nikkahit >: it would always be the other way around: < C, > nikkahit, sara aa >. Perhaps, for Thai and Lao (just because the Unicode decompositions are like that). But the expected sequence for the closely related Khmer scripts appears to be to have the nikhahit after the dependent vowel... Likewise for other Indic scripts, where the nikhahit-related characters are typographically clearly after the dependent vowel. However, the CTT/DUCET currently give only level 2 weights to visargas and anusvaras, ignoring them at level 1. I don't know if they should be given level 1 weights also for the other Indic scripts (like they should for Lao/Thai/Khmer). (See http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2716.doc.) /kent k PS While not related to Indic scripts (but it has similar grouping, with similar solution), I also submitted this contribution on Hangul collation to WG2: http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2715.doc > Of course, if the former, I would agree. > > > > Peter > > Peter Constable > Globalization Infrastructure and Font Technologies > Microsoft Windows Division > > >
<<attachment: winmail.dat>>