RE: sara am ordering (was RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?

Kent Karlsson Thu, 01 Apr 2004 04:36:18 -0800

Peter Constable wrote:

> Your doc says,
> 
> <quote, emphasis added>
> And  à should be ordered as à followed by  à (**which is the 
> logical sequence, despite the Unicode compatibility decomposition**).
> </quote>
> 
> What do you mean here by "logical sequence"? That that's how 
> it should be interpreted phonologically and for sorting 
> purposes,


Yes.

> or that that is the correct encoded sequence for 
> decomposed representations?

Well, it appears that sara am is rarely decomposed in practice
(unless one applies NFKD or NFKC, like for IDNs).

However, the spelling convention in Khmer, where the nikhahit
looks much like it does for Thai and Lao, appears to be to have
the nikhahit after the vowel mark (and there are no compatibility
precomposed forms). Ideally the <C, dep. vowel, nikhahit> sequence
should be interpreted the same as <C, nikhahit, dep. vowel> for Thai,
Lao, and Khmer (for their respective nikhahits). But all of the nikhahits
have combining class 0, so that will not follow from Unicode equivalences.
For collation, at least, my suggestion (in the referred documents) is
to treat them as equivalent for the orthographically used combinations
in Thai, Lao, and Khmer.

> If the latter, that seems to me to be quite wrong: I would 
> not expect *any* data that includes a decomposed 
> representation of sara am to have the sequence < C, sara aa, 
> nikkahit >: it would always be the other way around: < C, 
> nikkahit, sara aa >.

Perhaps, for Thai and Lao (just because the Unicode decompositions
are like that). But the expected sequence for the closely related Khmer
scripts appears to be to have the nikhahit after the dependent vowel...
Likewise for other Indic scripts, where the nikhahit-related characters
are typographically clearly after the dependent vowel. However, the
CTT/DUCET currently give only level 2 weights to visargas and
anusvaras, ignoring them at level 1. I don't know if they should be
given level 1 weights also for the other Indic scripts (like they should
for Lao/Thai/Khmer). (See http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2716.doc.)

                /kent k

PS
While not related to Indic scripts (but it has similar grouping, with similar
solution), I also submitted this contribution on Hangul collation to WG2:
 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2715.doc



> Of course, if the former, I would agree.
> 
> 
> 
> Peter
>  
> Peter Constable
> Globalization Infrastructure and Font Technologies
> Microsoft Windows Division
> 
> 
>

<<attachment: winmail.dat>>

RE: sara am ordering (was RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?

Reply via email to