On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

D strings exhibit no such problems. They expose their implementation - array of code units. Having that available is often handy. They also obey a formal interface - bidirectional ranges.

It's convenient that char[] and wchar[] expose a dchar bidirectional range interface... but only when a dchar bidirectional range is what you want to use. If you want to iterate over code units (lower-level representation), or graphemes (upper-level representation), then it gets in your way.

There is no easy notion of "character" in unicode. A code point is *not* a character. One character can span multiple code points. I fear treating dchars as "the default character unit" is repeating same kind of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and treating each 2-byte code unit as a character. I mean, what's the point of working with the intermediary representation (code points) when it doesn't represent a character?

Instead, I think it'd be better that the level one wants to work at be made explicit. If one wants to work with code points, he just rolls a code-point bidirectional range on top of the string. If one wants to work with graphemes (user-perceived characters), he just rolls a grapheme bidirectional range on top of the string. In other words:

        string str = "hello";
        foreach (cu; str) {}            // code unit iteration
foreach (cp; str.codePoints) {} // code point iteration, bidirectional range of dchar foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional range of graphemes

That'd be much cleaner than having some sort of hybrid code-point/code-unit array/range.

Here's a nice reference about unicode graphemes, word segmentation, and related algorithms.
<http://unicode.org/reports/tr29/>

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to