Hi,
I'm currently experimenting with various trade-offs for Unicode normalisation code. 
Any comments on these (particularly of the "that's insane, here's why, stop now!" 
variety) would be welcome.

The first is an optimisation of speed over size. Rather than perform the decomposition 
as a recursive operation the necessary data is stored to do so in a single pass. For 
example rather than compute <U+212B> -> <U+00C5> -> <U+0041, U+030A> recursively one 
can store the data to compute <U+212B> -> <U+0041, U+030A>. This reduces the amount of 
work to decompose each character, and further benefits from the fact that if there is 
no trailing combining characters (that is if the next character is a starter) then no 
re-ordering is required.

The second is an optimisation of both speed and size, with the disadvantage that data 
cannot be shared between NFC and NFD operations (which is perhaps a reasonable trade 
in the case of web code which might only need NFC code to be linked). In this version 
decompositions of stable codepoints are ommitted from the decompositon data. For 
example since following the decomposition <U+0104> -> <U+0041, U+0328> there can be no 
character that is unblocked from the U+0041 that will combine with it, hence there is 
no circumstance in which they will not be recombined to U+0104 and hence dropping that 
decomposition from the data will not affect NFC (the relevant data would still have to 
be in the composition table, as the sequence <U+0041, U+0328> might occur in the 
source code).





Reply via email to