> So, without some kind of case translation dictionary that can be
> trusted on the particular strings we want to test, can we assume
> that's it's not actually a solvable problem? (because, like divide by
> zero, the question isn't valid to start with)

Here's the dictionary: http://unicode.org/Public/UNIDATA/SpecialCasing.txt

The file defines four case types: Lowecase_Mapping; Titlecase_Mapping;
Uppercase_Mapping and Case_Folding.

The Unicode Consortium defines a default caseless matching algorithm and
it's the Case_Folding form we want for an equivalence test. The very first
character is the infamous LATIN SMALL LETTER SHARP S which has a case
folding form of "ss"

You then take into account normalization where characters are transformed
into their decomposed forms - so, e-acute becomes the separate "e" and the
acute combining mark.

The default canonical caseless match algorithm is:

  NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

(I'm summarising section 3.13 of the Unicode Standard 5.2)

Note that SpecialCasing.txt defines rules for different languages, including
things like what happens when one letter follows another letter.

Sorting (collation) is more complicated. Would you believe that in some
languages people expect Z to come after A? Rules for collation are defined
in the separate Unicode Technical Standard #10.

I would never want to implement these algorithms myself so it's handy that
Java does it for us in the java.text classes (which I'm guessing are derived
from the icu-project).

I'm not an expert by any means - you can learn a lot just by browsing the
first couple of chapters of the (surprisingly readable) Unicode Standard:
http://www.unicode.org/standard/standard.html

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to javapo...@googlegroups.com.
To unsubscribe from this group, send email to 
javaposse+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

Reply via email to