On Tue, May 22, 2012 at 2:22 PM, Richard Wordingham < [email protected]> wrote:
> > > > I can dig up the ICU code that computes the > > > > collation case bits for a string. > > It would be helpful. I can't see well enough how the data gets in. > I found the code that computes the case bits (2 bits for lower/mixed/upper) for building ICU tailorings. Search for "getCaseBits" in Java: main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java<http://bugs.icu-project.org/trac/browser/icu4j/trunk/main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java> C++: source/i18n/ucol_bld.cpp<http://bugs.icu-project.org/trac/browser/icu/trunk/source/i18n/ucol_bld.cpp> Sadly, this code looks fishy. I just submitted http://bugs.icu-project.org/trac/ticket/9337 It is also clear that the CLDR UCA/DUCET table used in ICU (FractionalUCA.txt) is built with different code for the case bits that works for supplementary characters. For example Deseret small/capital letter long i have the correct case bits in our version of the DUCET but both get "lower case" bits when tailoring them. &™=™ seems to change U+2122 TRADE MARK SIGN from <compat> lowercase > tertiary weight tagged as lower case to <compat> lowercase > tertiary weight tagged as upper case! As a consequence, when > CaseFirst=uppercase is selected, it suddenly sorts before the 2-letter > string 'TM'! This seems to be because its decomposition mapping as > <TM> is examined. > Yes, the first step in getCaseBits() is to normalize to NFKD. However, this sets the case bits, not the tertiary weight. They are separate in our implementation. With default collation options, the case bits get ignored (masked away). With "case level on", they get moved into a separate level between secondary & tertiary. With "case first" they are retained in the tertiary-weight byte so that case differences trump other tertiary differences. On the other hand, &\ua7f8=\ua7f8 has no effect on the sorting of > U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE, which continues to be > sorted as lower case. Apparently true, but I don't understand why. I would have to try this in the debugger. The current getCaseBits() should get the "upper case" bits from the NFKD version U+0126. I am beginning to believe that it is impossible for ICU users to tailor > U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE to be upper case! > You cannot explicitly determine the case bits, only the relative tertiary weights. The case bits are computed. markus -- Google Internationalization Engineering

