Re: Why the hell doesn't foreach decode strings

Steven Schveighoffer Mon, 31 Oct 2011 06:26:11 -0700

On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu<seewebsiteforem...@erdani.org> wrote:

On 10/26/11 7:18 AM, Steven Schveighoffer wrote:

On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
<simen.kja...@gmail.com> wrote:

On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
<schvei...@yahoo.com> wrote:

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.


If this is correct (and it is), then decoding to dchar is simply not
enough.
You seem to advocate decoding to graphemes, which is a whole different
matter.


I am advocating that. And it's a matter of perception. D can say "we
only support code-point decoding" and what that means to a user is, "we
don't support language as you know it." Sure it's a part of unicode, but
it takes that extra piece to make it actually usable to people who
require unicode.

Even in English, fiancé has an accent. To say D supports unicode, but
then won't do a simple search on a file which contains a certain *valid*
encoding of that word is disingenuous to say the least.


Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
     if (line.canFind("fiancé")) {
        writeln("There it is.");
     }
}


I think Jonathan answered that quite well, nothing else to add...

D needs a fully unicode-aware string type. I advocate D should use it as
the default string type, but it needs one whether it's the default or
not in order to say it supports unicode.
How do you define "supports Unicode"? For my money, the main sin of(w)string is that it offers [] and .length with potentially confusingsemantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what themeaning of sin(string) (as in sine) means :)

I think there are two problems with [] and .length. First, that theyimply "get nth character" and "number of characters" respectively, andsecond, that many times they *actually are* those things.

So I agree with you the proposed string type needs to curb that interface,while giving us a fully character/grapheme aware interface (which iscurrently lacking).

I made an early attempt at doing this, and I will eventually get around tofinishing it. I was in the middle of creating an algorithm toas-efficiently-as-possible delineate a grapheme, when I got sidetracked byother things :) There are still lingering issues with the language whichmakes this a less-than-ideal replacement (arrays currently enjoy a lot of"extra features" that custom types do not).


-Steve

Re: Why the hell doesn't foreach decode strings

Reply via email to