Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

Richard Wordingham Wed, 23 May 2012 14:10:52 -0700

On Wed, 23 May 2012 10:35:46 -0700
Markus Scherer <[email protected]> wrote:

> On Tue, May 22, 2012 at 2:22 PM, Richard Wordingham <
> [email protected]> wrote:

> I found the code that computes the case bits (2 bits for
> lower/mixed/upper) for building ICU tailorings. Search for
> "getCaseBits" in

> Java:
> main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java<http://bugs.icu-project.org/trac/browser/icu4j/trunk/main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java>

> C++:
> source/i18n/ucol_bld.cpp<http://bugs.icu-project.org/trac/browser/icu/trunk/source/i18n/ucol_bld.cpp>

> Sadly, this code looks fishy. I just submitted
> http://bugs.icu-project.org/trac/ticket/9337

While we're picking on that poor routine - it looks as though it could
come unstuck with kana in the supplementary planes - the Kana
Supplement, and possibly also the Enclosed Ideographic Supplement.  Do
you want a comment on that added to the ticket, or does that issue
deserve a whole ticket to itself?

Comment 2 in http://bugs.icu-project.org/trac/ticket/9337 seems to be
the answer to my opening question - the case for caseFirst and
caseLevel tailorings is defined, in the absence of non-parametric
tailorings, by FractionalUCA.txt.  Is there a definition of the precise
relationship between DUCET and FractionalUCA.txt, or does
FractionalUCA.txt define the relationship? I presume FractionalUCA.txt
takes precedence over UCA_Rules.txt.  They do differ - the file
FractionalUCA.txt assigns <U+0FB2, U+034F, U+0F71> and <U+0FB2, U+0F71>
the same 3-level weights, but UCA_Rules.txt assigns them a tertiary
difference.  I've reported that in formal Unicode feedback.

> It is also clear that the CLDR UCA/DUCET table used in ICU
> (FractionalUCA.txt) is built with different code for the case bits
> that works for supplementary characters.

A further wrinkle is that case seems more a property of collation
elements than of characters.  I haven't checked that one can read back
from case assignments in FractionalUCA.txt to DUCET.  (In general,
there need not be an element-to-element mapping between collation
*elements* for equivalent UCA-compliant collations.)  At present, the
primarily non-ignorable collation elements of a character of general
category Lt are an uppercase collation element followed by a lowercase
collation element.  As you've said, no mixed case in the root locale.

Richard.

Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

Reply via email to