28-Apr-2013 20:56, Jesse Phillips пишет:
This is a replacement module for the current std.uni by Dmitry
Olshansky. The std.uni module provides an implementation of fundamental
Unicode algorithms and data structures.

To use this module, install 2.63 beta, import uni; and not std.uni,
compile two files from the source uni.d unicode_tables.d

Docs:
http://blackwhale.github.io/phobos/uni.html

Source:
https://github.com/blackwhale/gsoc-bench-2012

DMD Beta:
http://forum.dlang.org/post/517c8552.7040...@digitalmars.com

It should be noted that inclusion into Phobos may require addressing
inter-dependencies, see "Reducing the inter-dependencies"
http://forum.dlang.org/post/kl8hn8$bm3$1...@digitalmars.com

We have only one week for review left so I'd like to sort out the last issues before we get to the voting.

First to fill in on latest developments.
With a bunch of ugly hacks I've managed to integrate new std.uni in my Phobos fork and it passes unittests for me now (on win32 at least).

See it hanging there and waiting to be destroyed by the pull tester:
https://github.com/D-Programming-Language/phobos/pull/1289

Remaining issues that I'm aware of:
- proper toLower/toUpper (current one is simplified codepoint-for-codepoint)
- clean up the debris after crush-landing back into Phobos, revert some unrelated changes etc.

Please take time to make that list grow, esp w.r.t interface choices and the code itself.

Plus separately I'd need to remove rudimentary versions of the same data-structures used in std.regex and rewire it to use the new std.uni.

There are few bugs and issues uncovered during integration that I wish to get feedback on.

std.string has a bogus test for toLower:
Of the very few tests being done 2 are very special corner case around \u0130 which is I with dot and is expected to be lowercased to i. But it's *not* supposed to - this conversion is specific to Turk(?) locale (=tailoring). What should happen is unfolding it to 2-codepoint sequence 'i' and 'dot-above' (this is in works).

I just hope nobody depends on these particular conversions and I am wondering who's put them there in the first place.

std.json is another thing - 0x7F somehow is specifically tested as being accepted as part of string literal. Yet ECMA script docs clearly state that Unicode control characters are to be stripped even before lexing (ignored even in literals).

P.S. Someday I need to track down and file about 2 (or 3?) distinct compiler bugs (fwd-ref hell, private alias hijacking) that I worked around while getting there.
Another one has a fix already (thanks, Kenji):
http://d.puremagic.com/issues/show_bug.cgi?id=10067

--
Dmitry Olshansky

Reply via email to