On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
On 1/17/11 10:55 AM, spir wrote:
On 01/15/2011 12:21 AM, Michel Fortin wrote:
Also, it'd really help this discussion to have some hard numbers about
the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages
of Text object construction) (but the measured algos are not yet
stabilised, so that said numbers regularly change, but in the right
sense ;-)
You can try the current version at
https://bitbucket.org/denispir/denispir-d/src (the perf module is called
chrono.d)

For information, recently, the cost of full text construction: decoding,
normalisation (both decomp & ordering), piling, was about 5 times
decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
informed me about a new gain in piling I have not yet tested.
This performance places our library in-between Windows native tools and
ICU in terms of speed. Which is imo rather good for a brand new tool
written in a still unstable language.

I have carefully read your arguments on Text's approach to
systematically "pile" and normalise source texts not beeing the right
one from an efficiency point of view. Even for strict use cases of
universal text manipulation (because the relative space cost would
indirectly cause time cost due to cache effects). Instead, you state we
should "pile" and/or normalise on the fly. But I am, similarly to you,
rather doubtful on this point without any numbers available.
So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my
expectation; UTF adds for certain primitives up to 3x overhead compared
to ASCII, and I expect combining character handling to bring about as
much on top of that.

Your work and Steve's won't go to waste; one way or another we need to
add grapheme-based processing to D. I think it would be great if later
on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src It is actually a bit more than just a string type considering true characters as natural elements. * It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to