On Tue, 15 May 2012 21:33:03 -0700 Markus Scherer <markus....@gmail.com> wrote:
> On Tue, May 15, 2012 at 4:42 PM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > > > I am puzzled as to how an implementation can compliantly implement > > the tailoring of normalisation in the UCA. > I think you mean something like "implement tailorings where > contractions overlap with decomposition mappings" rather than > tailoring of normalization. No. Brute force use of NFD solves most problems. > Can an implementation be said to compliantly implement the tailoring > of > > normalisation if nominally turning it off actually has no effect? > > If it can, my puzzlement goes away. > The definition of a tailoring is not the problem. It it supposed to > work in the expected way with a compliant implementation, regardless > of how the implementation achieves that. Section 5.1 of the UCA says that one may have a parametric normalisation tailoring. Unfortunately, it is not clear to me how one demonstrates that a normalisation tailoring of 'off' may have been or has not been implemented correctly. Possibly it is any (necessarily non-'Unicode compliant') collation that correctly sorts NFD (or is it FCD?) strings but fails for some other strings. In which case, is it necessary for it to fail for at least some strings? Obviously there are inequivalent collations that achieve these effects. (For example, my blocking test assumes that the string it is working is in NFD. If it did not, then the results given an arbitrary string would be different.) Now, the concept of a parametric normalisation tailoring could be a confusion with the concept of having a function interface that requires that input strings (as strings of codepoints, rather than as text) be in a suitable format. > > Does anyone believe they have a compliant normalisation tailoring of > > DUCET? Does it work for FCD strings? Unless I'm very much > > mistaken, ICU doesn't > > (http://bugs.icu-project.org/trac/ticket/9323). > I think this might be a duplicate of > http://bugs.icu-project.org/trac/ticket/8052 Not quite. I believe pure Tibetan script can be sorted out by adding a finite number of 'contractions' (I am not sure whether they are valid for discontiguous contraction). No. 8052 needs a contraction <U+0FB2 U+0334 U+0F81> and for each substring that behaves like U+0334, therefore an infinite set, therefore needing an algorithmic solution rather than just a bigger table. (I don't dispute that solving 8052 is likely to solve 9323.) However, it would surprise me if the collation behaviour of <U+0FB2 U+0334 U+0F81> were changed. In so far as it is linguistically meaningful, it is an error in DUCET that it doesn't sort the same as <U+0FB2 U+0F81 U+0334>. (Of course, Tibetan collation in DUCET is already very wrong for Tibetan script languages.) > Maybe I should even modify the ICU normalization FCD code (outside > collation) to always decompose the Tibetan composite vowels. Certainly the safest method! As the precomposed vowels are deprecated, it even has merit independent of getting collation to work. Another alternative would be to process part of a composite character when forming the collation, but that gets messy. Richard.