Well, officially the final bell has rung, marking the end of GSOC.
Meaning it's about time to show the project to the community.
This time around I sadly have some unresolved issues. Part of these are
my fault, others are well known bugs in phobos/compiler.
Still there is a lot of cool stuff in there that I'd love to tell about:
- all functions isXXX and toUpper/toLower of the old std.uni interface
suddenly became faster and/or smarter
- icmp function that does proper case insensitive string comparison
and matches e.g. german ß (Sulzbacher form) as equal to 'ss' (full
casefolding rules)
- performance maniacs can use faster/simpler one: sicmp that maps only
1:1 codepoints (simple casefolding rules)
- extended grapheme cluster support: decode operation (decodeGrapheme)
& slightly simpler a-la std.utf.stride to only get the length in
codeunits (graphemeStride)
- normalization currently only NFD & NFKD, have some issues see below
(and I still need to triple check the correctness) NFC & NFKC are coming
soon
- decompositon (and composition is coming): either Canonical or
Compatibility also yields Grapheme with decomposed codepoint
And the last but not least, library users get access to all the power
toys used to construct the above algorithms:
1) codepoint sets with full & fast set ops
2) highly customizable multi-stage lookup table (aka Trie) with
easy helpers to construct optimal multi-level dchar-->bool tables
3) a ton of predefined Unicode sets: see general property, block or
script
Caveats:
- the NFC & NFKC normalization are in the works, I'll try to get it
sometime later this week.
- more then that normalization depends on patched Phobos and still
often fails due to the bug
http://d.puremagic.com/issues/show_bug.cgi?id=4584.
Patched Phobos is here:
https://github.com/blackwhale/phobos/tree/stable-sort
- no 64bit currently. Somehow I managed to broke my _fresh_ 64bit
installation of dmd (it fails both on Phobos unit tests & anything in my
project), thus x64 lacks a bulk of generated tables and is unsupported
right now. Any help is appreciated.
Grab sources + tests, benchmarks, tools and sample data from:
https://github.com/blackwhale/gsoc-bench-2012/zipball/beta
And the sketchy DDoc:
http://blackwhale.github.com/phobos/std_uni.html
The first step to usage is "import uni;" vs "import std.uni;" and adding
uni.d to your command line.
Note: icmp may conflict with its brain dead twin from std.algorithm (or
was that std.string?) use the usual tricks to disambiguate as necessary.
I'd enjoy some feedback as way back in 2010 I recall a lot of
Unicode-aware people longing for grapheme support. A short list of Ali
Çehreli, Fawzi Mohamed and Michel Fortin comes to mind maybe others will
chime in.
P.S. Consider it as "ready for comments" as opposed to "ready for review".
P.P.S. Volunteers who'd like to test x64 are welcome to run
rdmd gen_uni.d
and report back (maybe it's my local setup problem).
--
Olshansky Dmitry