Re: Why the hell doesn't foreach decode strings

Martin Nowak Fri, 21 Oct 2011 02:52:33 -0700

On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright<newshou...@digitalmars.com> wrote:

On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
And why would you iterate over a string with foreach without decoding it
unless you specifically need to operate on code units (which I wouldexpect tobe _very_ rare)? Sure, copying doesn't require decoding, but searchingsure
does
No, it doesn't. If I'm searching for a dchar, I'll be searching for asubstring in the UTF-8 string. It's far, FAR more efficient to search asa substring rather than decoding while searching.
Even more, 99.9999% of searches involve an ascii search string. It issimply not necessary to decode the searched string, as encoded charscannot be ascii. For example:
    foreach (c; somestring)
          if (c == '+')
                found it!

gains absolutely nothing by decoding somestring.
(unless you're specifically looking for a code unit rather than a code
point, which would not be normal). Most anything which needs to operateon thecharacters of a string needs to decode them. And iterating over them todomuch of anything would require decoding, since otherwise you'reoperating oncode units, and how often does anyone do that unless they'respecifically
messing around with character encodings?
What you write sounds intuitively correct, but in my experience writingUnicode processing code, it simply isn't true. One rarely needs todecode.
However, in most cases, that is _not_
what the programmer actually wants. They want to iterate overcharacters, notpieces of characters. So, the default at this point is _wrong_ in thecommon
case.
This is simply not my experience when working with Unicode. Performancetakes a big hit when one structures an algorithm to requiredecoding/encoding. Doing the algorithm using substrings is a huge win.
Take a look at dmd's lexer, it handles Unicode correctly and avoidsdoing decoding as much as possible.

You have a good point here. I would have immediately thrown out the loopAFTER profiling.What hits me here is that I had an incorrect program with built-in unicodeaware strings.This is counterintuitive to correct unicode handling throughout the stdlibrary,and even more to the complementary operation of appending any char type tostrings.


martin

Re: Why the hell doesn't foreach decode strings

Reply via email to