On Thu, 17 May 2012 21:32:19 -0700 Markus Scherer <markus....@gmail.com> wrote:
> Ok, but assuming we didn't add 0FB2+0F71, why can't we add the > contraction 0FB2+0F81 and have the 0334 and any other non-starter be > handled via discontiguous matching? Time for me to make a pronouncement on collation in FCD from my ivory tower. First, I need some notation. For a string S, uni(S) is the single character canonically equivalent to it. If there are multiple such characters, uni(S) is selected arbitrarily but determinisitically, e.g. the first such character in code point order. If there is no such character, the notation uni(S) is invalid. I am also assuming that the set of contractions for use with normalisation is automatically subjected to canonical closure. Up to UCA 6.1.0 (UTS#10 Version 24), there are two modes of contraction identification - contiguous and discontiguous. When working with FCD strings rather than NFD strings (i.e. with normalisation switched off), there are therefore various types of contractions. 2-element contractions in FCD can be split into contiguous and discontiguous contractions. Given an NFD contraction A+B+C, uni(<A,B>)+C is a discontiguous FCD contraction. If there is also an NFD contraction A+B, then A+uni(<B,C>) is also a *discontiguous* FCD contraction. However, if there is no NFD contraction A+B, then A+uni(<B,C>) is a *contiguous* FCD contraction. It can only be applied to a subsequence <A, uni(<B,C>)>, never to a subsequence <A, X, uni(<B,C>)>. For example, in DUCET 6.1.0 (and earlier), there is an NFD contraction 0FB2+0F71+0F80, but no contraction 0FB2+0F71. Consequently, 0FB2+uni(0F71,0F80), i.e. 0FB2+0F81, although listed in the DUCET 6.1.0 file allkeys.txt, is only a *contiguous* FCD contraction. Therefore it has no effect on the collation of <0FB2, 0334, OF81>. Blocking also changes subtly when one proceeds from NFD to FCD. In NFD, B blocks C if and only if ccc(B) = ccc(C), ccc(B) = 0 or ccc(C) = 0. Equivalently, B does *not* block C if and only if B and C are distinct and <B,C> and <C,B> are canonically equivalent. For FCD, we must use the latter definition. Additionally, the concept is only defined if <B,C> is FCD. Determining this in the general case is not quick. However, for Unicode 6.1.0 this can be greatly simplified by replacing the ccc look-up function by eccc where: eccc(uni(<0F71,x>)) = ccc(x) eccc(x) = ccc(x) otherwise and then using the first form of the NFD definition of blocking. This simplification could be defeated by the addition of new non-singleton decompositions for characters with non-zero ccc, which Unicode is about to promise not to add, or the addition of new characters with the same canonical combining class as U+0F71. The nasty complication is determining what the contractions for FCD processing are from the contractions for NFD processing. Sometimes a finite set for NFD processing expands to an infinite set for FCD processing. Richard.