On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

It's no secret that string et al. are not a magic recipe for writing correct Unicode code.

[...]

What would you do? Any ideas are welcome.

UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8 string and want to search for an occurrence of that string in another UTF-8 string, you don't have to decode each multi-byte code-points: a binary comparison is enough. If you're counting counting the number of code points, then all you need is to count the number of code unit with the most significant bit set to zero. If on the other hand you're applying a character-by-character transformation, then you need to fully decode each character, unless you're only interested in transforming characters from the lower non-multibyte subrange of the encoding (which happens quite often).

Clearly, I don't think there's a one-size-fit-all way to iterate over string arrays. Fully decoding each code unit is clearly the most costly method; it shouldn't be required when its not necessary.

I think we need to be able to represent char[] and wchar[] as a range of dchar to deal with cases where you want to iterate over Unicode code points, but I'd let the programmer ultimately decide what to do.

As for .length, I'll say that removing this property would make it hard to write low-level code. For instance, if I copy a string into a buffer, I need to know the length in bytes (array.length * sizeof(array[0])), not the number of characters. So it doesn't make much sense to disable .length.

So my answer would be mostly to leave things as they are.

Perhaps the char[] and wchar[] as dchar ranges could be aliased to string and wstring, but that'd definitely be a blow to the philosophy of strings as simple arrays. You'd also still need to be able to access the actual array underneath. And will all the implicit conversions still work? I'm really not sure it's worth it, but perhaps.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to