Re: Why the hell doesn't foreach decode strings

Michel Fortin Thu, 20 Oct 2011 21:56:35 -0700

On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisp...@gmx.com> said:

Sure, if you _know_ that you're dealing with a string with only ASCII, it's
faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interestingproperty of UTF encodings: if you want to search for a substring in awell-formed UTF sequence, you do not need to decode the bigger string,comparing the UTF-x code units of the substring with the UTF-x codeunits of the bigger string is plenty enough.

Similarly, if you're searching for the 'ê' code point in an UTF-8string, the most efficient way is to search the string for the two-byteUTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê'to a string). Decoding the whole string is a wasteful process.

Sure, if you _know_ that you're dealing with a string with only ASCII, it's
faster to just iterate over chars, but then you can explicitly give the type
of the foreach variable as char, but normally what people care about is
iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then youneed to take into account combining marks that form multi-code-pointgraphemes. (You'll probably want to deal with unicode normalizationtoo.) Treating code points as if they were characters is amisconception in the same way as treating UTF-16 code units ascharacter is: both works most of the time but also fail in a number ofcases.

So, I would expect the
case where people _want_ to iterate over chars to be rare. In most cases,
iterating over a string as chars is a bug - one which in many cases won't be
quickly caught, because the programmer is English speaking and uses almost
exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the onlysolution to that problem?

Defaulting to the
guaranteed correct handling of characters and special casing when it's
possible to write code more efficiently than that is definitely the way to go
about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only hassignificantly more chances of being correct.

The fact that foreach doesn't
is incongruous with how strings are handled in most other cases.


You could also argue that ranges are doing things the wrong way.

I like the type deduction feature of foreach, and don't think it should be
removed for strings. Currently, it's consistent - T[] gets an element type
of T.


Sure, the type deduction of foreach is great, and it's completely consistent
that iterating over an array of chars would iterate over chars rather than
dchars when you don't give the type. However, in most cases, that is _not_
what the programmer actually wants. They want to iterate over characters, not
pieces of characters.


I note that you keep confusing characters with code units.

I want to reiterate that there's no way to program strings in D without
being cognizant of them being a multibyte representation. D is both a high
level and a low level language, and you can pick which to use, but you
still gotta pick.


I fully agree that programmers need to properly understand unicode to use
strings in D properly. However, the problem is that the default handling of
strings with foreach is _not_ what programmers are going to normally want, so
the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Givingthem dchars by default won't eliminate bugs related to multi-code-pointcharacters, but it'll likely eliminate bugs relating to multi-code-unitsequences. That could be a good start. I'd say choosing dchar is apractical compromise between the "characters by default" and "the typeof the array by default", but it is neither of those ideals. How isthat pragmatic trade-off going to fare a few years in the future? I'm alittle skeptical that this is the ideal solution.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: Why the hell doesn't foreach decode strings

Reply via email to