Re: Case mapping of dotless lowercase letters

Kenneth Whistler Tue, 16 Dec 2003 18:57:13 -0800

John Cowan noted:

<quote>
Here's what happens exactly:


 source         simple case folding     full case folding       tr/az case folding
 dotted i       dotted i                dotted i                dotted i
 dotless i      dotless i               dotless i               dotless i
 dotted I       dotted I                dotted i + comb. dot    dotted i
 dotless I      dotted i                dotted i                dotless i
</quote>

Add to that specification of the case *folding* (from
CaseFolding.txt), the default case *mappings* (from
UnicodeData.txt):

 source         default lc mapping      default uc mapping
 dotted i       dotted i                (dotless) I
 dotless i      dotless i               (dotless) I
 dotted I       dotted i                dotted I
 (dotless) I    dotted i                (dotless) I
 
If you are case *folding* you are doing one thing; if you are
case *mapping* you are doing another.

Case *folding* creates equivalence classes for different sequences.

Simple case folding, as defined above, creates the following 
equivalence classes, adding in the sequences involving use of
the combining dot as well.

   A. { i, I }
   B. { dotless i }
   C. { dotted I }
   D. { <i, dot above>, <I, dot above> }
   E. { <dotless i, dot above> }
   F. { <dotted I, dot above> }
   
These 6 classes are distinguished. They do not conflate, although
in class A and in class D, there are two sequences which do fold
together.

Full case folding, as defined above, creates the following
equivalence classes.

   A. { i, I }
   B. { dotless i }
   G. { dotted I, <i, dot above>, <I, dot above> }
   E. { <dotless i, dot above> }
   F. { <dotted I, dot above> }
   
In other words, there are now 5, not 6 equivalence classes, as the
classes C and D from simple case folding have been conflated.

Turkic/Azeri case folding, as defined above, creates the following
equivalence classes.

   H. { i, dotted I }
   I. { dotless i, I }
   J. { <i, dot above>, <dotted I, dot above> }
   K. { <dotless i, dot above>, <I, dot above> }
   
And now there are 4 *different* equivalence classes, which group
together the sequences which make sense for Turkish/Azeri.

Note that none of the 3 sets of equivalence classes violates
*canonical* equivalence, because none of the 8 sequences involved
is canonically equivalent to any other. In other words, no matter
which of the 3 approaches you take to case folding, in no instance
are you claiming that canonically equivalent sequences are to be
interpreted differently.

Now let's look at what happens with case *mapping*, using the
default mappings of UnicodeData.txt.

Lowercasing first:

   L. { i, I, dotted I } --> i
   B. { dotless i }      --> dotless i
   M. { <i, dot above>, <I, dot above>, <dotted I, dot above> }
                         --> <i, dot above>
   E. { <dotless i, dot above> } --> <dotless i, dot above>
   
Uppercasing next:

   N. { i, I, dotless i } --> I
   C. { dotted I }        --> dotted I
   O. { <i, dot above>, <I, dot above>, <dotless i, dot above> }
                         --> <I, dot above>
   F. { <dotted I, dot above> } --> <dotted I, dot above>
   
The classes of sequences that get conflated are different here. In
particular, classes L, M, N, O conflate characters that are not
conflated by the formal definition of case folding.

So, in particular, one should *not* expect the results of case
mapping, followed by a binary comparison, to be the same as
a formal case folding comparison. There will be differences.
Any implementation that does not take this into account is still
confused (aren't we all?) in its handling of these letters.

Now add to that the problem of which of the elements in the
equivalence classes *look* the same, and you have the potential
for even more confusion. In particular, in simple case folding,
you have the equivalence classes:

   A. { i, I }
   E. { <dotless i, dot above> }
  
Members of class E are *not* equivalent to members of class A.
But of course, <dotless i, dot above> *looks like* i and does
*not* look like I. Add in the others, plus all the potential
differences in how fonts may implemented the soft-dotted
property, and this entire area can lead to total confusion.

One moral of the story is: DO NOT USE COMBINING DOTS WITH I's.

If you subtract out all the superfluous combinations cited above
with combining dots (for completeness), then the situation
becomes much simpler and more comprehensible:

Simple case folding. [disallows string length change]

   A. { i, I }
   B. { dotless i }
   C. { dotted I }
   
Full case folding.   [allows string length change]

   A. { i, I }
   B. { dotless i }
   G. { dotted I }  [represented in folded form as <i, dot above>]
   
Turkic/Azeri case folding.

   H. { i, dotted I }
   I. { dotless i, I }

Lowercasing:

   L. { i, I, dotted I } --> i
   B. { dotless i }      --> dotless i
   
Uppercasing:

   N. { i, I, dotless i } --> I
   C. { dotted I }        --> dotted I

Add in Turkic locale-specific special casing.

Lowercasing:

   H. { i, dotted I }     --> i
   I. { dotless i, I }    --> dotless i
   
Uppercasing:

   H. { i, dotted I }     --> dotted I
   I. { dotless i, I }    --> I
   
That is *still* complicated enough. But you could at least copy that
out, paste it on the wall, and expect an engineer to get it right
in an implementation.

By the way, the UTC has been over this stuff so many times that the
topic is by now one that elicits groans of "Not those damn Turkish
i's again!" when brought up in the meetings. It is very unlikely that
the current specification is going to be changed again in any
way. Nothing anyone could do could improve the situation. All it
would accomplish would be to destabilize any implementation that people
already have of this stuff.

Anyone who -- in Unicode data -- adds combining dots to i's deserves
the trouble they will get into. And anyone who tries to represent
dotted i's by putting combining dots on dotless i's also deserves
the trouble they will get into. (The same will be true of j's, once
the recently approved dotless j character is published.)

Also, beware of two of the big warnings provided in the Unicode
Standard and the Unicode Character Database about this stuff:

I. No casing operations are reversible.

II. Casing operations ... do not preserve normalization form.
    (This is true both of case mapping and of case folding.)
    
And, as the Turkish i's illustrate, case mappings are not
one-to-one in a functional sense. A lowercasing may conflate
two distinct uppercase characters into a single lowercase,
and an uppercasing may conflate two distinct lowercase
characters into a single uppercase.

Ignore these facts at your peril and at the peril of the customers
who depend on your implementations.

--Ken

Re: Case mapping of dotless lowercase letters

Reply via email to