On Wed, 23 May 2012 10:35:46 -0700 Markus Scherer <[email protected]> wrote:
> On Tue, May 22, 2012 at 2:22 PM, Richard Wordingham < > [email protected]> wrote: > I found the code that computes the case bits (2 bits for > lower/mixed/upper) for building ICU tailorings. Search for > "getCaseBits" in > Java: > main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java<http://bugs.icu-project.org/trac/browser/icu4j/trunk/main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java> > C++: > source/i18n/ucol_bld.cpp<http://bugs.icu-project.org/trac/browser/icu/trunk/source/i18n/ucol_bld.cpp> > Sadly, this code looks fishy. I just submitted > http://bugs.icu-project.org/trac/ticket/9337 While we're picking on that poor routine - it looks as though it could come unstuck with kana in the supplementary planes - the Kana Supplement, and possibly also the Enclosed Ideographic Supplement. Do you want a comment on that added to the ticket, or does that issue deserve a whole ticket to itself? Comment 2 in http://bugs.icu-project.org/trac/ticket/9337 seems to be the answer to my opening question - the case for caseFirst and caseLevel tailorings is defined, in the absence of non-parametric tailorings, by FractionalUCA.txt. Is there a definition of the precise relationship between DUCET and FractionalUCA.txt, or does FractionalUCA.txt define the relationship? I presume FractionalUCA.txt takes precedence over UCA_Rules.txt. They do differ - the file FractionalUCA.txt assigns <U+0FB2, U+034F, U+0F71> and <U+0FB2, U+0F71> the same 3-level weights, but UCA_Rules.txt assigns them a tertiary difference. I've reported that in formal Unicode feedback. > It is also clear that the CLDR UCA/DUCET table used in ICU > (FractionalUCA.txt) is built with different code for the case bits > that works for supplementary characters. A further wrinkle is that case seems more a property of collation elements than of characters. I haven't checked that one can read back from case assignments in FractionalUCA.txt to DUCET. (In general, there need not be an element-to-element mapping between collation *elements* for equivalent UCA-compliant collations.) At present, the primarily non-ignorable collation elements of a character of general category Lt are an uppercase collation element followed by a lowercase collation element. As you've said, no mixed case in the root locale. Richard.

