Well, officially the final bell has rung, marking the end of GSOC.

Meaning it's about time to show the project to the community.
This time around I sadly have some unresolved issues. Part of these are my fault, others are well known bugs in phobos/compiler.

Still there is a lot of cool stuff in there that I'd love to tell about:

- all functions isXXX and toUpper/toLower of the old std.uni interface suddenly became faster and/or smarter

- icmp function that does proper case insensitive string comparison and matches e.g. german ß (Sulzbacher form) as equal to 'ss' (full casefolding rules)

- performance maniacs can use faster/simpler one: sicmp that maps only 1:1 codepoints (simple casefolding rules)

- extended grapheme cluster support: decode operation (decodeGrapheme) & slightly simpler a-la std.utf.stride to only get the length in codeunits (graphemeStride)

- normalization currently only NFD & NFKD, have some issues see below (and I still need to triple check the correctness) NFC & NFKC are coming soon

- decompositon (and composition is coming): either Canonical or Compatibility also yields Grapheme with decomposed codepoint

And the last but not least, library users get access to all the power toys used to construct the above algorithms:
    1) codepoint sets with full & fast set ops
2) highly customizable multi-stage lookup table (aka Trie) with easy helpers to construct optimal multi-level dchar-->bool tables 3) a ton of predefined Unicode sets: see general property, block or script

Caveats:
- the NFC & NFKC normalization are in the works, I'll try to get it sometime later this week.

- more then that normalization depends on patched Phobos and still often fails due to the bug http://d.puremagic.com/issues/show_bug.cgi?id=4584.

Patched Phobos is here: https://github.com/blackwhale/phobos/tree/stable-sort

- no 64bit currently. Somehow I managed to broke my _fresh_ 64bit installation of dmd (it fails both on Phobos unit tests & anything in my project), thus x64 lacks a bulk of generated tables and is unsupported right now. Any help is appreciated.

Grab sources + tests, benchmarks, tools and sample data from:
https://github.com/blackwhale/gsoc-bench-2012/zipball/beta

And the sketchy DDoc:
http://blackwhale.github.com/phobos/std_uni.html

The first step to usage is "import uni;" vs "import std.uni;" and adding uni.d to your command line.

Note: icmp may conflict with its brain dead twin from std.algorithm (or was that std.string?) use the usual tricks to disambiguate as necessary.

I'd enjoy some feedback as way back in 2010 I recall a lot of Unicode-aware people longing for grapheme support. A short list of Ali Çehreli, Fawzi Mohamed and Michel Fortin comes to mind maybe others will chime in.

P.S. Consider it as "ready for comments" as opposed to "ready for review".

P.P.S. Volunteers who'd like to test x64 are welcome to run
 rdmd gen_uni.d
and report back (maybe it's my local setup problem).


--
Olshansky Dmitry

Reply via email to