Re: Why the hell doesn't foreach decode strings

Walter Bright Thu, 20 Oct 2011 21:40:32 -0700

On 10/20/2011 8:58 PM, Jonathan M Davis wrote:

And why would you iterate over a string with foreach without decoding it
unless you specifically need to operate on code units (which I would expect to
be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substringin the UTF-8 string. It's far, FAR more efficient to search as a substringrather than decoding while searching.

Even more, 99.9999% of searches involve an ascii search string. It is simply notnecessary to decode the searched string, as encoded chars cannot be ascii. Forexample:


   foreach (c; somestring)
         if (c == '+')
                found it!

gains absolutely nothing by decoding somestring.

(unless you're specifically looking for a code unit rather than a code
point, which would not be normal). Most anything which needs to operate on the
characters of a string needs to decode them. And iterating over them to do
much of anything would require decoding, since otherwise you're operating on
code units, and how often does anyone do that unless they're specifically
messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicodeprocessing code, it simply isn't true. One rarely needs to decode.

However, in most cases, that is _not_
what the programmer actually wants. They want to iterate over characters, not
pieces of characters. So, the default at this point is _wrong_ in the common
case.

This is simply not my experience when working with Unicode. Performance takes abig hit when one structures an algorithm to require decoding/encoding. Doing thealgorithm using substrings is a huge win.

Take a look at dmd's lexer, it handles Unicode correctly and avoids doingdecoding as much as possible.

Re: Why the hell doesn't foreach decode strings

Reply via email to