There's one other issue that should be considered at some stage: normalization and the 
fact that a single "character" can be constructed from several code points. 
(acutes and such)

This is my next little project. May build on Steve's job. (But it's not 
necessary, dchar is enough as a base, I guess.)


Hi Denis, you might want to consider helping us out.

We have got a feature-complete Unicode normalization, case-folding, and concatenation implementation passing all test cases in http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then some) for all recent Unicode versions. This code was part of a bigger project that we have stopped working on.

We feel that the Unicode normalization part might be useful to others. Therefore we consider releasing them under an open source license. Before we can do so, we have to clean up things a bit. Some open issues are

a) The code still contains some TODOs and FIXMEs (bugs, inefficiencies, some bigger issues like more efficient storing of data etc.).

b) No profiling and no benchmarking against the ICU implementation (http://site.icu-project.org/) has been done yet (we expect surprises).

c) Implementation of additional Unicode algorithms (e.g. full case mapping, matching, collation).

Since we have stopped working on the bigger project, we haven’t made much progress. Any help would be welcome. Let me know whether this would be of interest to you.

Reply via email to