Re: locale-aware string comparisons

2013-01-19 Thread Richard Wordingham
On 29 December 2012, James Cloos asked:
 
>> Given (just) the data in 10646, Unicode and cldr, are there any
>> locales where a case-insensitive match should be different than a
>> case-preserving match of the results of lower-casing the two
>> strings?

On Mon, 31 Dec 2012 23:29:48, "Whistler, Ken" 
wrote:

> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR
> may have to jump in here, but while locales clearly *are* in the
> scope of LDML and CLDR, there is currently little if anything they
> have to say about specific case mapping rules.

Mark Davis has answered this in part.  However, there is one set of
differences that have not been mentioned at all - digraphs treated as
letters, e.g. in Welsh and Danish.  The key problem with these,
especially with "ng" in Welsh (where g < ng < h), is that sometimes the
sequence is a digraph and sometimes not.  With camel case words (and a
good case for Welsh is Scottish surnames such as McHenry - 'ch' is a
digraph in Welsh, but obviously not in this name), digraphs do not
(exceptions, anyone?) straddle the case-marked boundaries.
Accordingly, in Welsh we have 'ce' < 'ci' < 'ch', 'Ce' < 'Ci' < 'Ch',
'CE' < 'CI' < 'CH', but 'cE' < 'cH' < 'cI'.  A solution, if you care
greatly about correctness (CLDR does not), to preprocess sequences of
lower case followed by upper case by inserting CGJ, i.e. U+034F
COMBINING GRAPHEME JOINER.  As far as I am aware, this only affects
sequences of general category Ll followed by Lu.  (I haven't checked
CLDR for special collation rules for any sequences of Ll followed by
Lu - do check before using my proposed solution.)

For most languages, there are the problems that CGJ is not provided on
keyboards and that CGJ is misrendered by old rendering systems.

Richard.



Re: locale-aware string comparisons

2013-01-02 Thread Mark Davis ☕
Agreed.

FYI, for those interested, here is the data file I generated with the
approaches A, B, C as discussed.

https://docs.google.com/a/google.com/spreadsheet/pub?key=0AqRLrRqNEKv-dGk0RHVoQWN6OGw1TVFNOWRaMEJfWEE&gid=0


Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**


On Wed, Jan 2, 2013 at 11:07 AM, Shawn Steele wrote:

> I'd try to avoid making a dependency where case mapping needs to be the
> same as case insensitive comparisons.
>
> I'd either always case fold then compare, or always compare case
> insensitive.
>
> -Shawn
>
> -Original Message-
> From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
> Behalf Of James Cloos
> Sent: Tuesday, January 1, 2013 5:43 PM
> To: Mark Davis ☕
> Cc: Whistler, Ken; unicode@unicode.org
> Subject: Re: locale-aware string comparisons
>
> >>>>> "MD" == Mark Davis ☕  writes:
>
> MD> All of these are different, all of them still have over 200
> MD> differences from either compare(lower(x),lower(y)) or compare(upper
> MD> (x),upper(y))
>
> What about, then:
>
>   compare(lower(x),lower(y)) || compare(upper(x),upper(y))
>
> Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL:
>
>   LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y)
>
> Would that cover all of the outliers?
>
> -JimC
> --
> James Cloos  OpenPGP: 1024D/ED7DAEA6
>
>
>
>


RE: locale-aware string comparisons

2013-01-02 Thread Shawn Steele
I'd try to avoid making a dependency where case mapping needs to be the same as 
case insensitive comparisons.

I'd either always case fold then compare, or always compare case insensitive.

-Shawn

-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of James Cloos
Sent: Tuesday, January 1, 2013 5:43 PM
To: Mark Davis ☕
Cc: Whistler, Ken; unicode@unicode.org
Subject: Re: locale-aware string comparisons

>>>>> "MD" == Mark Davis ☕  writes:

MD> All of these are different, all of them still have over 200 
MD> differences from either compare(lower(x),lower(y)) or compare(upper
MD> (x),upper(y))

What about, then:

  compare(lower(x),lower(y)) || compare(upper(x),upper(y))

Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL:

  LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y)

Would that cover all of the outliers?

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6







Re: locale-aware string comparisons

2013-01-01 Thread James Cloos
> "MD" == Mark Davis ☕  writes:

MD> All of these are different, all of them still have over 200
MD> differences from either compare(lower(x),lower(y)) or compare(upper
MD> (x),upper(y))

What about, then:

  compare(lower(x),lower(y)) || compare(upper(x),upper(y))

Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL:

  LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y)

Would that cover all of the outliers?

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6



Re: locale-aware string comparisons

2013-01-01 Thread Mark Davis ☕
> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR

James,
Even without locale differences, the situation is a bit tricky. Assuming
that str_tolower() and str_toupper() were straightforwardly defined in
terms of the (full) Unicode case mappings, there is still the issue that
the DUCET does not define a caseless compare. It puts case together with
other variants into a set of "Level 3" data. There are 3 approaches one can
take with a strcasecmp() straightforwardly based on LDML. I generated some
numbers for these with a quick test program, but note that they use the
CLDR root locale, which has a few changes from DUCET.

A. Define it to be just comparing after Unicode case folding.

B. Use DUCET and only compare according to Level 1 & 2. That ignores case,
but also some other features.

C. Use the case level as defined in LDML, plus Levels 1 & 2.

All of these are different, all of them still have over 200 differences
from either compare(lower(x),lower(y)) or compare(upper(x),upper(y)) These
are mostly because special weighting of compatibility variants, or of the
Greek iota subscript. Example:

s < ſ, but upper( s ) = upper( ſ ) // LATIN SMALL LETTER S vs LATIN SMALL
LETTER LONG S




Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Dec 31, 2012 at 3:29 PM, Whistler, Ken  wrote:

> Well, in answering the question which was actually posed here:
>
> 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because
> 10646 does not define case mapping at all.
>
> 2. The Unicode Standard *does* define case mapping, of course, as well as
> case folding. The relevant details are in Section 3.13 of the standard,
> supported by various data files in the Unicode Character Database. TUS 6.2,
> Section 3.13, p. 117, does define toUpperCase(X) and toLowerCase(X), but
> those are string mapping operations, not directly comparable to Linux (and
> in general Unix) toupper() and tolower(), which are character mapping
> functions. The closer correlates to Linux toupper() and tolower() are
> Unicode's definitions of Uppercase_Mapping(C) and Lowercase_Mapping(C).
> However, there is a significant difference lurking, in that the Unicode
> case mapping definitions are not locale-sensitive. The full case mappings
> do include two conditional sets of mappings (from SpecialCasing.txt) for
> Lithuanian and for Turkish and Azeri, mostly affecting the behavior of the
> dot on "i", but the use of those conditional mappings depends on the
> availability of explicit language context.
>
> This contrasts with the Linux (and in general Unix) toupper() and
> tolower() functions, which in principle, at least, are locale-sensitive,
> depending on the current locale setting, and in particular on whether the
> LC_CTYPE category in the locale has a non-null list of mappings for toupper
> and/or tolower in it.
>
> Perhaps even more importantly, the Unicode Standard does not state
> anything regarding the details of the behavior of the APIs strcasecmp() or
> tolower() or toupper() in libc. Those are the concerns of the C and POSIX
> specs, not the Unicode Standard. Nor could the Unicode Standard really get
> involved in this, precisely because  that behavior involves locales, and
> locales are outside the scope of the Unicode Standard.
>
> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may
> have to jump in here, but while locales clearly *are* in the scope of LDML
> and CLDR, there is currently little if anything they have to say about
> specific case mapping rules.
>
> As regards the particulars of the question, I suspect that it would depend
> in part on how strcasecmp(), str_tolower() and str_toupper() are
> implemented (I am assuming string conversions APIs here based on the
> tolower() and toupper() APIs), but there probably *are* instances where the
> results would diverge. The most likely source of trouble would be Turkish
> case mapping. In particular, if you compare U+0130 LATIN CAPITAL LETTER I
> WITH DOT ABOVE to a canonically equivalent sequence of ,
> there may be conundrums. If strcasecmp() is implemented based on Turkish
> case folding, then strcasecmp( U+0130,  ) == 0. If
> str_tolower() is based on Turkish case mapping, then str_tolower( U+0130 )
> == , so strcmp(str_tolower( U+0130), str_ tolower(
>  ) == 0, *but* str_toupper( U+0130 ) == U+0130 and
> str_toupper(  ) == , so strcmp(str_toupper(
> U+0130 ), str_toupper(  ) != 0. The two upperc!
>  ased versions are *canonically* equivalent, but you wouldn't expect a
> strcmp() operation to be checking normalization of strings. So unless the
> implementations of str_tolower() and str_ toupper() were doing canonical
> normalization as well as case mapping, you could indeed find some odd edge
> cases for Turkish casing, at least.
>
> --Ken
>
> > Given (just) the data in 10646, Unicode and cldr, are there any locales
> > where a case-insensitive match should be different than a 

RE: locale-aware string comparisons

2012-12-31 Thread Whistler, Ken
Well, in answering the question which was actually posed here:

1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 
does not define case mapping at all.

2. The Unicode Standard *does* define case mapping, of course, as well as case 
folding. The relevant details are in Section 3.13 of the standard, supported by 
various data files in the Unicode Character Database. TUS 6.2, Section 3.13, p. 
117, does define toUpperCase(X) and toLowerCase(X), but those are string 
mapping operations, not directly comparable to Linux (and in general Unix) 
toupper() and tolower(), which are character mapping functions. The closer 
correlates to Linux toupper() and tolower() are Unicode's definitions of 
Uppercase_Mapping(C) and Lowercase_Mapping(C). However, there is a significant 
difference lurking, in that the Unicode case mapping definitions are not 
locale-sensitive. The full case mappings do include two conditional sets of 
mappings (from SpecialCasing.txt) for Lithuanian and for Turkish and Azeri, 
mostly affecting the behavior of the dot on "i", but the use of those 
conditional mappings depends on the availability of explicit language context.

This contrasts with the Linux (and in general Unix) toupper() and tolower() 
functions, which in principle, at least, are locale-sensitive, depending on the 
current locale setting, and in particular on whether the LC_CTYPE category in 
the locale has a non-null list of mappings for toupper and/or tolower in it.

Perhaps even more importantly, the Unicode Standard does not state anything 
regarding the details of the behavior of the APIs strcasecmp() or tolower() or 
toupper() in libc. Those are the concerns of the C and POSIX specs, not the 
Unicode Standard. Nor could the Unicode Standard really get involved in this, 
precisely because  that behavior involves locales, and locales are outside the 
scope of the Unicode Standard.

3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may have 
to jump in here, but while locales clearly *are* in the scope of LDML and CLDR, 
there is currently little if anything they have to say about specific case 
mapping rules.

As regards the particulars of the question, I suspect that it would depend in 
part on how strcasecmp(), str_tolower() and str_toupper() are implemented (I am 
assuming string conversions APIs here based on the tolower() and toupper() 
APIs), but there probably *are* instances where the results would diverge. The 
most likely source of trouble would be Turkish case mapping. In particular, if 
you compare U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE to a canonically 
equivalent sequence of , there may be conundrums. If 
strcasecmp() is implemented based on Turkish case folding, then strcasecmp( 
U+0130,  ) == 0. If str_tolower() is based on Turkish case 
mapping, then str_tolower( U+0130 ) == , so strcmp(str_tolower( 
U+0130), str_ tolower(  ) == 0, *but* str_toupper( U+0130 ) == 
U+0130 and str_toupper(  ) == , so 
strcmp(str_toupper( U+0130 ), str_toupper(  ) != 0. The two 
upperc!
 ased versions are *canonically* equivalent, but you wouldn't expect a strcmp() 
operation to be checking normalization of strings. So unless the 
implementations of str_tolower() and str_ toupper() were doing canonical 
normalization as well as case mapping, you could indeed find some odd edge 
cases for Turkish casing, at least.

--Ken

> Given (just) the data in 10646, Unicode and cldr, are there any locales
> where a case-insensitive match should be different than a case-preserving
> match of the results of lower-casing the two strings?
> 
> Ie, in terms of locale-aware versions of the typical libc functions,
> should strcasecmp(s1,s2) ever generate different results than
> strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> (By mentioning strcmp() et al, I do not exclude mb or w versions of
> those functions.)
> 
> And to be clear, the questions isn't about any specific, existing
> implementation but only about what the 10646, unicode and cldr suite
> of standards have to say on the matter.
> 
> Thanks,
> 
> -JimC
> --
> James Cloos  OpenPGP: 1024D/ED7DAEA6





Re: locale-aware string comparisons

2012-12-29 Thread Philippe Verdy
Case-insensitive searches should not use tolower() or toupper() to convert
strings before comparing them. Yes cases where this could be different
exist and this is caused by the fact that case are not always in simple
pairs, or cases where the conversion to lowercase or uppercase drops other
distinctions than just case differences (e.g. the final sigma in Greek, and
some rules for the German Ess-Tsett, or the long-form s, and its ligatures).
It would be safer to use "casefolding", which does not enforce the
conversion to lowercase, and preserves other semantics.


2012/12/29 James Cloos 

> Given (just) the data in 10646, Unicode and cldr, are there any locales
> where a case-insensitive match should be different than a case-preserving
> match of the results of lower-casing the two strings?
>
> Ie, in terms of locale-aware versions of the typical libc functions,
> should strcasecmp(s1,s2) ever generate different results than
> strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> (By mentioning strcmp() et al, I do not exclude mb or w versions of
> those functions.)
>
> And to be clear, the questions isn't about any specific, existing
> implementation but only about what the 10646, unicode and cldr suite
> of standards have to say on the matter.
>
> Thanks,
>
> -JimC
> --
> James Cloos  OpenPGP: 1024D/ED7DAEA6
>
>


locale-aware string comparisons

2012-12-29 Thread James Cloos
Given (just) the data in 10646, Unicode and cldr, are there any locales
where a case-insensitive match should be different than a case-preserving
match of the results of lower-casing the two strings?

Ie, in terms of locale-aware versions of the typical libc functions,
should strcasecmp(s1,s2) ever generate different results than
strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
(By mentioning strcmp() et al, I do not exclude mb or w versions of
those functions.)

And to be clear, the questions isn't about any specific, existing
implementation but only about what the 10646, unicode and cldr suite
of standards have to say on the matter.

Thanks,

-JimC
-- 
James Cloos  OpenPGP: 1024D/ED7DAEA6