Re: Making all strings UTF ranges has some risk of WTF

Michel Fortin Wed, 03 Feb 2010 20:20:17 -0800

On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:

It's no secret that string et al. are not a magic recipe for writingcorrect Unicode code.
[...]

What would you do? Any ideas are welcome.

UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8string and want to search for an occurrence of that string in anotherUTF-8 string, you don't have to decode each multi-byte code-points: abinary comparison is enough. If you're counting counting the number ofcode points, then all you need is to count the number of code unit withthe most significant bit set to zero. If on the other hand you'reapplying a character-by-character transformation, then you need tofully decode each character, unless you're only interested intransforming characters from the lower non-multibyte subrange of theencoding (which happens quite often).

Clearly, I don't think there's a one-size-fit-all way to iterate overstring arrays. Fully decoding each code unit is clearly the most costlymethod; it shouldn't be required when its not necessary.

I think we need to be able to represent char[] and wchar[] as a rangeof dchar to deal with cases where you want to iterate over Unicode codepoints, but I'd let the programmer ultimately decide what to do.

As for .length, I'll say that removing this property would make it hardto write low-level code. For instance, if I copy a string into abuffer, I need to know the length in bytes (array.length *sizeof(array[0])), not the number of characters. So it doesn't makemuch sense to disable .length.


So my answer would be mostly to leave things as they are.

Perhaps the char[] and wchar[] as dchar ranges could be aliased tostring and wstring, but that'd definitely be a blow to the philosophyof strings as simple arrays. You'd also still need to be able to accessthe actual array underneath. And will all the implicit conversionsstill work? I'm really not sure it's worth it, but perhaps.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: Making all strings UTF ranges has some risk of WTF

Reply via email to