bearophile wrote:
Walter Bright:
1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices.

2. doing the operations in (1) using code points and having to continually
 decode the strings would result in disastrously slow code.

In my original post I have forgotten another difference over arrays: 5b) a
method like ".unit()" that allows to index code units. So "foo".unit(1) is
always O(1). Lower level code can use this method as [] is used for arrays.

This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit.

(Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)


3. the user can always layer a code point interface over the strings, but
going the other way is not so practical.

This is true. But it makes the string usage unnecessarily low-level and
hard...

I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer.


 A better design in a smart system language as D is to give strings a
default high level "interface" that sees strings as what they are at high
level, and add a second lower level interface when you need faster
lower-level fiddling (so they have [] that returns code points and unit()
that returns code units).

I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification.

Based on that, I believe the D string design hits the sweet spot between efficiency and utility.

Reply via email to