On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisp...@gmx.com> said:

Sure, if you _know_ that you're dealing with a string with only ASCII, it's
faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interesting property of UTF encodings: if you want to search for a substring in a well-formed UTF sequence, you do not need to decode the bigger string, comparing the UTF-x code units of the substring with the UTF-x code units of the bigger string is plenty enough.

Similarly, if you're searching for the 'ê' code point in an UTF-8 string, the most efficient way is to search the string for the two-byte UTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê' to a string). Decoding the whole string is a wasteful process.


Sure, if you _know_ that you're dealing with a string with only ASCII, it's
faster to just iterate over chars, but then you can explicitly give the type
of the foreach variable as char, but normally what people care about is
iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then you need to take into account combining marks that form multi-code-point graphemes. (You'll probably want to deal with unicode normalization too.) Treating code points as if they were characters is a misconception in the same way as treating UTF-16 code units as character is: both works most of the time but also fail in a number of cases.


So, I would expect the
case where people _want_ to iterate over chars to be rare. In most cases,
iterating over a string as chars is a bug - one which in many cases won't be
quickly caught, because the programmer is English speaking and uses almost
exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the only solution to that problem?


Defaulting to the
guaranteed correct handling of characters and special casing when it's
possible to write code more efficiently than that is definitely the way to go
about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only has significantly more chances of being correct.


The fact that foreach doesn't
is incongruous with how strings are handled in most other cases.

You could also argue that ranges are doing things the wrong way.


I like the type deduction feature of foreach, and don't think it should be
removed for strings. Currently, it's consistent - T[] gets an element type
of T.

Sure, the type deduction of foreach is great, and it's completely consistent
that iterating over an array of chars would iterate over chars rather than
dchars when you don't give the type. However, in most cases, that is _not_
what the programmer actually wants. They want to iterate over characters, not
pieces of characters.

I note that you keep confusing characters with code units.

I want to reiterate that there's no way to program strings in D without
being cognizant of them being a multibyte representation. D is both a high
level and a low level language, and you can pick which to use, but you
still gotta pick.

I fully agree that programmers need to properly understand unicode to use
strings in D properly. However, the problem is that the default handling of
strings with foreach is _not_ what programmers are going to normally want, so
the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Giving them dchars by default won't eliminate bugs related to multi-code-point characters, but it'll likely eliminate bugs relating to multi-code-unit sequences. That could be a good start. I'd say choosing dchar is a practical compromise between the "characters by default" and "the type of the array by default", but it is neither of those ideals. How is that pragmatic trade-off going to fare a few years in the future? I'm a little skeptical that this is the ideal solution.

--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to