RE: UCD 3.1, Final Beta - Case folding

2001-03-06 Thread Carl W. Brown

Antone;

Case folding is very useful for Turkish.  For example "Istanbul" is spelled
with an uppercase I DOT ABOVE in Turkish.  By case folding but versions are
converted to "istanbul" for matching purposes.

Case folding also converts Greek beta symbol to a small letter beta.

In essence case folding is the equivalent of shift to upper followed by a
shift to lower.

The I shifts are

To upper:

0049 - 0049
0069 - 0049
0130 - 0130
0131 - 0049

To lower:

0049 - 0069
0130 - 0069

The only real difference is that all sigmas are the non-final sigma.  There
is no need for the sigma adjustment since the text is for comparison purpose
only.

What I am suggesting is that removing the COMBINING DOT ABOVE after any i
will produce a better matching string.  I can find no instance where
dropping it will case false matches.  Not dropping it will produce false
mismatches.

Carl



-Original Message-
From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 11:19 AM
To: Unicode List
Subject: RE: UCD 3.1, Final Beta - Case folding




-Original Message-
From: Antoine Leca [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 05, 2001 9:57 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding


Carl W. Brown wrote:

 I noticed that there is no mention of the casing special case:

 # Lithuanian

 0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or
 titlecase

 The case folding is locale-less so it seems to me the it is probably
better
 to remove the COMBINING DOT ABOVE after all 'i' / 'I' regardless of
locale
 to make it work for Lithuanian.  I doubt that this will case serious
 problems with caseless compares for other locales.

I think the 'I' above is a typo, isn't it? You meant 'j', don't you?

I do mean 'i' not 'j'.

If not, please consider a Turkish text, fully decomposed: there, a
dot_above
U+0307 following an uppercase I U+0049 should certainly *not* be dropped.

This works for Turkish as well.  Case folding folds dotted and dotless i
into 'i'.

0049; C; 0069; # LATIN CAPITAL LETTER I
0130; I; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0131; I; 0069; # LATIN SMALL LETTER DOTLESS I

By removing the COMBINING DOT ABOVE, the fully decomposed text will match
the composed text and therefore be a better representation of case folding.


Antoine




RE: UCD 3.1, Final Beta - Case folding

2001-03-06 Thread Carl W. Brown

Antone,

One difference between upper/lower case shifting and case folding is that case folding 
is locale-less.

This is the same as the upper case then lower case shift in a locale that has no 
special locale rules such as English or French.

You can not just remove accents especially in a locale-less function.  Sometimes the 
accent makes it a separate letter.  It probably would not create too many mismatches 
removing the ring above the A in Danish but it would mess up sorting sequences (A with 
ring above is the last letter in the alphabet).  You real problem language would 
probably be languages like Vietnamese.  You have many short words that are 
distinguished by tone marks or the use of different vowels.  These vowels are 
represented by the same letter with different accent marks.

Yes case shifting destroys the Turkish and Azeri ı/I and i/İ relationship.

The case that I was referring to was the Lithuanian lower case dotted i followed by a 
COMBINING DOT ABOVE which becomes a simple dotless upper case I when shifted.  The two 
dot lower case i becomes a standard dotless uppercase I.  A round trip upper/lower 
case shift in the "lt" locale will remove the COMBINING DOT ABOVE after the i.  This 
is like changing the German sharp-s to "ss" so that it will match "SS" shifted to 
lower case.  

Carl

 





-Original Message-
From: Antoine Leca [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 06, 2001 8:02 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding


[utf-8]

Carl W. Brown wrote:
 
 From: Antoine Leca [mailto:[EMAIL PROTECTED]]
 
 Carl W. Brown wrote:
 
  The case folding is locale-less so it seems to me the it is probably
  better to remove the COMBINING DOT ABOVE after all 'i' / 'I'
  regardless of locale
  to make it work for Lithuanian.  I doubt that this will case serious
  problems with caseless compares for other locales.
 
 please consider a Turkish text, fully decomposed: there, a dot_above
 U+0307 following an uppercase I U+0049 should certainly *not* be dropped.
 
 This works for Turkish as well.  Case folding folds dotted and dotless i
 into 'i'.

This is where I do not understand.

You are saying that for some Turk, the result of the caseless comparison
will be that ı/I and i/İ will be fully intermixed.

I was understanding they expect that all the ı/I (regardless of the case)
should come before all the i/İ. Did I miss something?

Or viewed from another point, I was not sure that İstambul should match
Istambul in a _Turkish_ caseless search.

OTOH, I am neither a Turkish expert nor a i18n expert, so perhaps caseless
comparisons should ignore all accents and the like (i.e. grouping c and č,
и and й, etc. Perhaps I am overemphasing, but I hope you will get the idea)


Antoine




Re: UCD 3.1, Final Beta - Case folding

2001-03-06 Thread Antoine Leca

Carl W. Brown wrote:
 
 One difference between upper/lower case shifting and case folding is that case
 folding is locale-less.

Yes, this is something I overlooked.
Thanks for taking the patience to teach it to me.

 
 You can not just remove accents especially in a locale-less function.

That was my understanding, and this is the primary reason I answered your
post: I do not see the rationale to remove the  ̇ after i and I, but not
after j.


 The case that I was referring to was the Lithuanian lower case dotted i followed
 by a COMBINING DOT ABOVE which becomes a simple dotless upper case I when shifted.

I understand the rationale when it follows the i, but I fail to follow when it comes
to the I.
I am sorry to be so dumb, but you should take in account that I do not implement the
algorithm and I am just analysing the problem. There are probably very good reasons
to include the I as well, but they presently escape me.

As far as I know, no Lituanian material is expected to contain the sequence "İ"
(\u0049\u0307). Or am I overlooking something obvious?


Antoine