Antone;
Case folding is very useful for Turkish. For example "Istanbul" is spelled
with an uppercase I DOT ABOVE in Turkish. By case folding but versions are
converted to "istanbul" for matching purposes.
Case folding also converts Greek beta symbol to a small letter beta.
In essence case folding is the equivalent of shift to upper followed by a
shift to lower.
The I shifts are
To upper:
0049 -> 0049
0069 -> 0049
0130 -> 0130
0131 -> 0049
To lower:
0049 -> 0069
0130 -> 0069
The only real difference is that all sigmas are the non-final sigma. There
is no need for the sigma adjustment since the text is for comparison purpose
only.
What I am suggesting is that removing the COMBINING DOT ABOVE after any i
will produce a better matching string. I can find no instance where
dropping it will case false matches. Not dropping it will produce false
mismatches.
Carl
-----Original Message-----
From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 11:19 AM
To: Unicode List
Subject: RE: UCD 3.1, Final Beta - Case folding
-----Original Message-----
From: Antoine Leca [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 9:57 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding
>Carl W. Brown wrote:
>>
>> I noticed that there is no mention of the casing special case:
>>
>> # Lithuanian
>>
>> 0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or
>> titlecase
>>
>> The case folding is locale-less so it seems to me the it is probably
better
>> to remove the COMBINING DOT ABOVE after all 'i' / 'I' regardless of
locale
>> to make it work for Lithuanian. I doubt that this will case serious
>> problems with caseless compares for other locales.
>I think the 'I' above is a typo, isn't it? You meant 'j', don't you?
I do mean 'i' not 'j'.
>If not, please consider a Turkish text, fully decomposed: there, a
dot_above
>U+0307 following an uppercase I U+0049 should certainly *not* be dropped.
This works for Turkish as well. Case folding folds dotted and dotless i
into 'i'.
0049; C; 0069; # LATIN CAPITAL LETTER I
0130; I; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0131; I; 0069; # LATIN SMALL LETTER DOTLESS I
By removing the COMBINING DOT ABOVE, the fully decomposed text will match
the composed text and therefore be a better representation of case folding.
>Antoine