On 3/5/2012 2:32 PM, Denis Jacquerye wrote:
I guess it's less messy than other situations. I just couldn't help
wondering why combining letters with diacritics are being encoded but
letters with diacritics or out of the question.

Because the combining ones are *not* decomposed, and hence don't
have normalization issues. (At least as long as we don't start down the
inadvisable path of encoding decomposable ones...)

The base letters *are* decomposed, and have been so forever in the standard,
essentially. Because of that, base+diacritic and <base, combining-diacritic>
*are* normalized together. And because the decomposed form is
already present (and the normalized form will always be that), there
is nothing to gain by encoding precomposed versions of new base letters
of this sort.

Normalization was never designed to recurse through base letters used
as combining marks. And the very few instances of combining marks
which *do* have decompositions (see, e.g. U+0344 for a notorious
example), have made implementers' lives a misery. They create special
case funkiness in normalization, testing, collation, ...

--Ken


Reply via email to