On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu <seewebsiteforem...@erdani.org> wrote:

On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
<simen.kja...@gmail.com> wrote:

On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
<schvei...@yahoo.com> wrote:

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not
enough.
You seem to advocate decoding to graphemes, which is a whole different
matter.

I am advocating that. And it's a matter of perception. D can say "we
only support code-point decoding" and what that means to a user is, "we
don't support language as you know it." Sure it's a part of unicode, but
it takes that extra piece to make it actually usable to people who
require unicode.

Even in English, fiancé has an accent. To say D supports unicode, but
then won't do a simple search on a file which contains a certain *valid*
encoding of that word is disingenuous to say the least.

Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
     if (line.canFind("fiancé")) {
        writeln("There it is.");
     }
}

I think Jonathan answered that quite well, nothing else to add...


D needs a fully unicode-aware string type. I advocate D should use it as
the default string type, but it needs one whether it's the default or
not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of (w)string is that it offers [] and .length with potentially confusing semantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what the meaning of sin(string) (as in sine) means :)

I think there are two problems with [] and .length. First, that they imply "get nth character" and "number of characters" respectively, and second, that many times they *actually are* those things.

So I agree with you the proposed string type needs to curb that interface, while giving us a fully character/grapheme aware interface (which is currently lacking).

I made an early attempt at doing this, and I will eventually get around to finishing it. I was in the middle of creating an algorithm to as-efficiently-as-possible delineate a grapheme, when I got sidetracked by other things :) There are still lingering issues with the language which makes this a less-than-ideal replacement (arrays currently enjoy a lot of "extra features" that custom types do not).

-Steve

Reply via email to