RE: UCD 3.1, Final Beta - Case folding

Carl W. Brown Tue, 06 Mar 2001 08:28:02 -0800
Antone;

Case folding is very useful for Turkish.  For example "Istanbul" is spelled
with an uppercase I DOT ABOVE in Turkish.  By case folding but versions are
converted to "istanbul" for matching purposes.

Case folding also converts Greek beta symbol to a small letter beta.

In essence case folding is the equivalent of shift to upper followed by a
shift to lower.

The I shifts are

To upper:

0049 -> 0049
0069 -> 0049
0130 -> 0130
0131 -> 0049

To lower:

0049 -> 0069
0130 -> 0069

The only real difference is that all sigmas are the non-final sigma.  There
is no need for the sigma adjustment since the text is for comparison purpose
only.

What I am suggesting is that removing the COMBINING DOT ABOVE after any i
will produce a better matching string.  I can find no instance where
dropping it will case false matches.  Not dropping it will produce false
mismatches.

Carl



-----Original Message-----
From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 11:19 AM
To: Unicode List
Subject: RE: UCD 3.1, Final Beta - Case folding




-----Original Message-----
From: Antoine Leca [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 9:57 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding


>Carl W. Brown wrote:
>>
>> I noticed that there is no mention of the casing special case:
>>
>> # Lithuanian
>>
>> 0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or
>> titlecase
>>
>> The case folding is locale-less so it seems to me the it is probably
better
>> to remove the COMBINING DOT ABOVE after all 'i' / 'I' regardless of
locale
>> to make it work for Lithuanian.  I doubt that this will case serious
>> problems with caseless compares for other locales.

>I think the 'I' above is a typo, isn't it? You meant 'j', don't you?

I do mean 'i' not 'j'.

>If not, please consider a Turkish text, fully decomposed: there, a
dot_above
>U+0307 following an uppercase I U+0049 should certainly *not* be dropped.

This works for Turkish as well.  Case folding folds dotted and dotless i
into 'i'.

0049; C; 0069; # LATIN CAPITAL LETTER I
0130; I; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0131; I; 0069; # LATIN SMALL LETTER DOTLESS I

By removing the COMBINING DOT ABOVE, the fully decomposed text will match
the composed text and therefore be a better representation of case folding.


>Antoine
RE: UCD 3.1, Final Beta - Case folding

Reply via email to