Re: Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu Wed, 03 Feb 2010 23:10:16 -0800

Ali Çehreli wrote:

Andrei Alexandrescu wrote:
 > It's no secret that string et al. are not a magic recipe for writing
 > correct Unicode code. However, things are pretty good and could be
 > further improved by operating the following changes in std.array and
 > std.range:
 >
 > - make front() and back() for UTF-8 and UTF-16 automatically decode the
 > first and last Unicode character

They would yield dchar, right? Wouldn't that cause trouble in templatedcode?

Yes, dchar. There was some figuring out in parts of Phobos, but thegains are well worth it.

The simplifications are enormous. Until now, Phobos didn't hit the nailon the head with simple encoding/decoding/transcoding primitives. Therewere many attempts in std.utf, std.encoding, and std.string - all veryclunky to use. Now I can just write s.front to get the first dchar ofany string, and s.popFront to drop it. Very simple!

 > - make popFront() and popBack() skip one entire Unicode character
 > (instead of just one code unit)

That's perfectly fine, because the opposite operations do "encode":

    string s = "ağ";
    assert(s.length == 3);
    s ~= 'ş';
    assert(s.length == 5);

 > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

Ok.

 > - change hasLength to return false for UTF-8 and UTF-16 strings
I don't understand that one. strings have lengths. Adding and removingdoes not alter length by 1 for those types. I don't think it's a bigdeal. It is already so in the language for those types. dstring does nothave that problem and could be used when by-1 change is desired.

hasLength is a property used by range algorithms to tell them that arange stores the length with a particular meaning (the number ofelements). It is perfectly fine that strings don't obey hasLength but doexpose .length - it's just that it has different semantics.

 > (b) Operate the change and mention that in range algorithms you should
 > check hasLength and only then use "length" under the assumption that it
 > really means "elements count".
The change sounds ok and hasLength should yield true. Or... can itreturn an enum { no, kind_of, yes } ;)
Current utf.decode takes the index by reference and modifies it by theamount. Could popFront() do something similar?

I think we could dedicate a special function for that. In fact it doesexist I think - it's called stride().



Andrei

Re: Making all strings UTF ranges has some risk of WTF

Reply via email to