John Cowan noted: <quote> Here's what happens exactly:
source simple case folding full case folding tr/az case folding dotted i dotted i dotted i dotted i dotless i dotless i dotless i dotless i dotted I dotted I dotted i + comb. dot dotted i dotless I dotted i dotted i dotless i </quote> Add to that specification of the case *folding* (from CaseFolding.txt), the default case *mappings* (from UnicodeData.txt): source default lc mapping default uc mapping dotted i dotted i (dotless) I dotless i dotless i (dotless) I dotted I dotted i dotted I (dotless) I dotted i (dotless) I If you are case *folding* you are doing one thing; if you are case *mapping* you are doing another. Case *folding* creates equivalence classes for different sequences. Simple case folding, as defined above, creates the following equivalence classes, adding in the sequences involving use of the combining dot as well. A. { i, I } B. { dotless i } C. { dotted I } D. { <i, dot above>, <I, dot above> } E. { <dotless i, dot above> } F. { <dotted I, dot above> } These 6 classes are distinguished. They do not conflate, although in class A and in class D, there are two sequences which do fold together. Full case folding, as defined above, creates the following equivalence classes. A. { i, I } B. { dotless i } G. { dotted I, <i, dot above>, <I, dot above> } E. { <dotless i, dot above> } F. { <dotted I, dot above> } In other words, there are now 5, not 6 equivalence classes, as the classes C and D from simple case folding have been conflated. Turkic/Azeri case folding, as defined above, creates the following equivalence classes. H. { i, dotted I } I. { dotless i, I } J. { <i, dot above>, <dotted I, dot above> } K. { <dotless i, dot above>, <I, dot above> } And now there are 4 *different* equivalence classes, which group together the sequences which make sense for Turkish/Azeri. Note that none of the 3 sets of equivalence classes violates *canonical* equivalence, because none of the 8 sequences involved is canonically equivalent to any other. In other words, no matter which of the 3 approaches you take to case folding, in no instance are you claiming that canonically equivalent sequences are to be interpreted differently. Now let's look at what happens with case *mapping*, using the default mappings of UnicodeData.txt. Lowercasing first: L. { i, I, dotted I } --> i B. { dotless i } --> dotless i M. { <i, dot above>, <I, dot above>, <dotted I, dot above> } --> <i, dot above> E. { <dotless i, dot above> } --> <dotless i, dot above> Uppercasing next: N. { i, I, dotless i } --> I C. { dotted I } --> dotted I O. { <i, dot above>, <I, dot above>, <dotless i, dot above> } --> <I, dot above> F. { <dotted I, dot above> } --> <dotted I, dot above> The classes of sequences that get conflated are different here. In particular, classes L, M, N, O conflate characters that are not conflated by the formal definition of case folding. So, in particular, one should *not* expect the results of case mapping, followed by a binary comparison, to be the same as a formal case folding comparison. There will be differences. Any implementation that does not take this into account is still confused (aren't we all?) in its handling of these letters. Now add to that the problem of which of the elements in the equivalence classes *look* the same, and you have the potential for even more confusion. In particular, in simple case folding, you have the equivalence classes: A. { i, I } E. { <dotless i, dot above> } Members of class E are *not* equivalent to members of class A. But of course, <dotless i, dot above> *looks like* i and does *not* look like I. Add in the others, plus all the potential differences in how fonts may implemented the soft-dotted property, and this entire area can lead to total confusion. One moral of the story is: DO NOT USE COMBINING DOTS WITH I's. If you subtract out all the superfluous combinations cited above with combining dots (for completeness), then the situation becomes much simpler and more comprehensible: Simple case folding. [disallows string length change] A. { i, I } B. { dotless i } C. { dotted I } Full case folding. [allows string length change] A. { i, I } B. { dotless i } G. { dotted I } [represented in folded form as <i, dot above>] Turkic/Azeri case folding. H. { i, dotted I } I. { dotless i, I } Lowercasing: L. { i, I, dotted I } --> i B. { dotless i } --> dotless i Uppercasing: N. { i, I, dotless i } --> I C. { dotted I } --> dotted I Add in Turkic locale-specific special casing. Lowercasing: H. { i, dotted I } --> i I. { dotless i, I } --> dotless i Uppercasing: H. { i, dotted I } --> dotted I I. { dotless i, I } --> I That is *still* complicated enough. But you could at least copy that out, paste it on the wall, and expect an engineer to get it right in an implementation. By the way, the UTC has been over this stuff so many times that the topic is by now one that elicits groans of "Not those damn Turkish i's again!" when brought up in the meetings. It is very unlikely that the current specification is going to be changed again in any way. Nothing anyone could do could improve the situation. All it would accomplish would be to destabilize any implementation that people already have of this stuff. Anyone who -- in Unicode data -- adds combining dots to i's deserves the trouble they will get into. And anyone who tries to represent dotted i's by putting combining dots on dotless i's also deserves the trouble they will get into. (The same will be true of j's, once the recently approved dotless j character is published.) Also, beware of two of the big warnings provided in the Unicode Standard and the Unicode Character Database about this stuff: I. No casing operations are reversible. II. Casing operations ... do not preserve normalization form. (This is true both of case mapping and of case folding.) And, as the Turkish i's illustrate, case mappings are not one-to-one in a functional sense. A lowercasing may conflate two distinct uppercase characters into a single lowercase, and an uppercasing may conflate two distinct lowercase characters into a single uppercase. Ignore these facts at your peril and at the peril of the customers who depend on your implementations. --Ken