On 3/5/2012 2:32 PM, Denis Jacquerye wrote:
I guess it's less messy than other situations. I just couldn't help wondering why combining letters with diacritics are being encoded but letters with diacritics or out of the question.
Because the combining ones are *not* decomposed, and hence don't have normalization issues. (At least as long as we don't start down the inadvisable path of encoding decomposable ones...) The base letters *are* decomposed, and have been so forever in the standard, essentially. Because of that, base+diacritic and <base, combining-diacritic> *are* normalized together. And because the decomposed form is already present (and the normalized form will always be that), there is nothing to gain by encoding precomposed versions of new base letters of this sort. Normalization was never designed to recurse through base letters used as combining marks. And the very few instances of combining marks which *do* have decompositions (see, e.g. U+0344 for a notorious example), have made implementers' lives a misery. They create special case funkiness in normalization, testing, collation, ... --Ken