Re: std.algorithm.remove and principle of least astonishment

Michel Fortin Sun, 21 Nov 2010 17:15:33 -0800

On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:

D strings exhibit no such problems. They expose their implementation -array of code units. Having that available is often handy. They alsoobey a formal interface - bidirectional ranges.

It's convenient that char[] and wchar[] expose a dchar bidirectionalrange interface... but only when a dchar bidirectional range is whatyou want to use. If you want to iterate over code units (lower-levelrepresentation), or graphemes (upper-level representation), then itgets in your way.

There is no easy notion of "character" in unicode. A code point is*not* a character. One character can span multiple code points. I feartreating dchars as "the default character unit" is repeating same kindof mistake earlier frameworks made by adopting UCS-2 (now UTF-16) andtreating each 2-byte code unit as a character. I mean, what's the pointof working with the intermediary representation (code points) when itdoesn't represent a character?

Instead, I think it'd be better that the level one wants to work at bemade explicit. If one wants to work with code points, he just rolls acode-point bidirectional range on top of the string. If one wants towork with graphemes (user-perceived characters), he just rolls agrapheme bidirectional range on top of the string. In other words:


        string str = "hello";
        foreach (cu; str) {}            // code unit iteration

foreach (cp; str.codePoints) {} // code point iteration, bidirectionalrange of dcharforeach (gr; str.graphemes) {} // grapheme iteration, bidirectionalrange of graphemes

That'd be much cleaner than having some sort of hybridcode-point/code-unit array/range.

Here's a nice reference about unicode graphemes, word segmentation, andrelated algorithms.

<http://unicode.org/reports/tr29/>

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: std.algorithm.remove and principle of least astonishment

Reply via email to