On 1/13/11 7:09 PM, Michel Fortin wrote:
On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> said:

On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> wrote:
Let's take a look:

// Incorrect string code
void fun(string s) {
foreach (i; 0 .. s.length) {
writeln("The character in position ", i, " is ", s[i]);
}
}

// Incorrect string_t code
void fun(string_t!char s) {
foreach (i; 0 .. s.codeUnits) {
writeln("The character in position ", i, " is ", s[i]);
}
}

Both functions are incorrect, albeit in different ways. The only
improvement I'm seeing is that the user needs to write codeUnits
instead of length, which may make her think twice. Clearly, however,
copiously incorrect code can be written with the proposed interface
because it tries to hide the reality that underneath a variable-length
encoding is being used, but doesn't hide it completely (albeit for
good efficiency-related reasons).

You might be looking at my previous version. The new version (recently
posted) will throw an exception for that code if a multi-code-unit
code-point is found.

I was looking at your latest. It's code that compiles and runs, but
dynamically fails on some inputs. I agree that it's often better to
fail noisily instead of silently, but in a manner of speaking the
string-based code doesn't fail at all - it correctly iterates the code
units of a string. This may sometimes not be what the user expected;
most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

It also supports this:

foreach(i, d; s)
{
writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to
specify dchar.

Except it breaks with combining characters. For instance, take the
string "t̃", which is two code points -- 't' followed by combining tilde
(U+0303) -- and you'll get the following output:

The character in position 0 is t
The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the
notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?


Thanks,

Andrei

Reply via email to