Jill Ramonsky wrote: > In my experience, there is a performance hit. > > I had to write an API for my employer last year to handle > some aspects of Unicode. We normalised everything to NFD, > not NFC (but that's easier, not harder). Nonetheless, all > the string handling routines were not allowed to assume > that the input was in NFD, but they had to guarantee that > the output was. These routines, therefore, had to do a > "convert to NFD" on every input, even if the input were > already in NFD. This did have a significant performance > hit, since we were handling (Unicode) strings throughout > the app. > > I think that next time I write a similar API, I wll deal > with (string+bool) pairs, instead of plain strings, with > the bool meaning "already normalised". This would > definitely speed things up. Of course, for any strings > coming in from "outside", I'd still have to assume they > were not normalised, just in case.
You could have split the NFD process in two separate steps: 1) Decomposition per se; 2) Reordering of combining classes. You could have performed step 1 (which is presumably much heavier than 2) only on strings coming from "outside", and step 2 at every passage. In a further enhancement, step 2 could be called only upon operations which could produce non-canonical order: e.g. when concatenating strings but not when trimming them. To gain even more speed, you could implement an ad-hoc version of step 2 which only operates on out-of order characters adjacent to a specified location in the string (e.g., the joining point of a concatenation operation). Just my 0.02 euros. _ Marco