Antone,

One difference between upper/lower case shifting and case folding is that case folding 
is locale-less.

This is the same as the upper case then lower case shift in a locale that has no 
special locale rules such as English or French.

You can not just remove accents especially in a locale-less function.  Sometimes the 
accent makes it a separate letter.  It probably would not create too many mismatches 
removing the ring above the A in Danish but it would mess up sorting sequences (A with 
ring above is the last letter in the alphabet).  You real problem language would 
probably be languages like Vietnamese.  You have many short words that are 
distinguished by tone marks or the use of different vowels.  These vowels are 
represented by the same letter with different accent marks.

Yes case shifting destroys the Turkish and Azeri ı/I and i/İ relationship.

The case that I was referring to was the Lithuanian lower case dotted i followed by a 
COMBINING DOT ABOVE which becomes a simple dotless upper case I when shifted.  The two 
dot lower case i becomes a standard dotless uppercase I.  A round trip upper/lower 
case shift in the "lt" locale will remove the COMBINING DOT ABOVE after the i.  This 
is like changing the German sharp-s to "ss" so that it will match "SS" shifted to 
lower case.  

Carl

 





-----Original Message-----
From: Antoine Leca [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 06, 2001 8:02 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding


[utf-8]

Carl W. Brown wrote:
> 
> From: Antoine Leca [mailto:[EMAIL PROTECTED]]
> 
> >Carl W. Brown wrote:
> >>
> >> The case folding is locale-less so it seems to me the it is probably
> >> better to remove the COMBINING DOT ABOVE after all 'i' / 'I'
> >> regardless of locale
> >> to make it work for Lithuanian.  I doubt that this will case serious
> >> problems with caseless compares for other locales.
> 
> >please consider a Turkish text, fully decomposed: there, a dot_above
> >U+0307 following an uppercase I U+0049 should certainly *not* be dropped.
> 
> This works for Turkish as well.  Case folding folds dotted and dotless i
> into 'i'.

This is where I do not understand.

You are saying that for some Turk, the result of the caseless comparison
will be that ı/I and i/İ will be fully intermixed.

I was understanding they expect that all the ı/I (regardless of the case)
should come before all the i/İ. Did I miss something?

Or viewed from another point, I was not sure that İstambul should match
Istambul in a _Turkish_ caseless search.

OTOH, I am neither a Turkish expert nor a i18n expert, so perhaps caseless
comparisons should ignore all accents and the like (i.e. grouping c and č,
и and й, etc. Perhaps I am overemphasing, but I hope you will get the idea)


Antoine

Reply via email to