On Thu, 17 May 2012 21:32:19 -0700 Markus Scherer <markus....@gmail.com> wrote:
> On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > > As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, > > and > > so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71), > > and a strictly non-normalising tailoring therefore needs a > > contraction for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip > > the 0F80 from 0F81 and (ii) prevent the contraction 0FB2+0F81. > Ok, but assuming we didn't add 0FB2+0F71, why can't we add the > contraction 0FB2+0F81 and have the 0334 and any other non-starter be > handled via discontiguous matching? Because then we wouldn't have DUCET 6.1.0, but instead would probably have DUCET 6.2.0. > And assuming we do add 0FB2+0F71 as requested in L2/12-131R, do we > need infinite overlap contractions? See this > spreadsheet<https://docs.google.com/spreadsheet/pub?key=0Ag3w_MjvUEoRdFVabUR5elltX3pObXNYRnV5VWNiRGc&output=html> > . I've started the process of requesting the four 'overlap' contractions. I believe we won't need an infinity of overlap contractions if we add 0FB2+0F71. But we're then talking about DUCET 6.2.0, which doesn't yet exist. > lccc(0F73) = ccc(0F71) = 129 > > rccc(0F73) = ccc(0F72) = 130 > The DUCET has the contraction 0F71+0F72, and we should find a > discontiguous match on <0F71, 0F71, 0F71, 0F72> skipping the two > middle 0F71. That string is equivalent to the FCD-passing string > <0F71, 0F71, 0F73> but there is no 0F72 in sight there to complete > the match if we don't modify the string. But if we have the implementation-generated contractions for 0F71+0F73 and 0F71+0F73+0F72 (and the other pairs based on pairs of vowels from 0F72, 0F74 and 0F80), and F073 (and the other long vowels) are not blocked by 0F71, we're OK for UCA 6.1.0 at least as far back as UCA 4.1.0. (A collation has to cite a UCA/DUCET version to be fully specified!) Now, these are contractions are for non-normalised operation, so the lack of 0F71+0F71 is probably legal beyond UCA 6.1.0 - non-normalised collations have to work for FCD, they don't have to be well-formed. > If we cannot find a way to handle this with a finite (actually, small) > amount of data, then we either have to decompose those three Tibetan > composite vowels before they reach the core collation code, or, > frankly, we just document a limitation for ICU and point to the fact > that the use of these three characters is > "discouraged"<http://unicode.org/charts/PDF/U0F00.pdf>and they don't > occur in any normalized text (e.g., NFC). > > The more I think about these the more I believe I could live with > such a limitation. If we could get our code to support all of UCA, > provide a dozen runtime attributes, compare strings and return two > kinds of sort keys, be fast, and deliver correct results on all FCD > input except if these three characters are involved, I would be quite > happy. Solve the Danish blemish before dismissing Tibetan. The solution to both might be to decompose certain (generally collation-dependent) characters on FCD input. DUCET 6.2.0 will also need infinitely many contractions if another combining character is added with CCC equal to 129. Richard.