> > Hi, > > I'm currently experimenting with various trade-offs for Unicode > normalisation code. Any comments on these (particularly of the "that's > insane, here's why, stop now!" variety) would be welcome. > > You might want to look at, if not even use, the ICU open-source > implementation: > > http://oss.software.ibm.com/icu/ > http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unorm.cpp
I did, but when I started this I was more interested in simply comparing various optimisations as a study into the related techniques. However I recently hit a practical need for such code for another task, and while it's nice that I've a bunch of "work" code already done as "fun" code maybe I should just use ICU... > > The second is an optimisation of both speed and size, with the > disadvantage that data cannot be shared between NFC and NFD operations (which > is perhaps a reasonable trade in the case of web code which might only need NFC > code to be linked). In this version decompositions of stable codepoints are > ommitted from the decompositon data. For example since following the > decomposition <U+0104> -> <U+0041, U+0328> there can be no > character that is unblocked from the U+0041 that will combine with it, hence > there is no circumstance in which they will not be recombined to U+0104 and > hence dropping that decomposition from the data will not affect NFC (the > relevant data would still have to be in the composition table, as the sequence > <U+0041, U+0328> might occur in the source code). > > Sounds possible and clever. As far as I remember, ICU uses the normalization > quick check flags > (Unicode properties) to determine much of this, and should achieve the same in > most cases. The above would supplement use of quick check - indeed it would be a way of implementing the concept of "stable codepoints" that the UTR suggests using with quick check.

