On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> wrote:
Let's take a look:

// Incorrect string code
void fun(string s) {
foreach (i; 0 .. s.length) {
writeln("The character in position ", i, " is ", s[i]);
}
}

// Incorrect string_t code
void fun(string_t!char s) {
foreach (i; 0 .. s.codeUnits) {
writeln("The character in position ", i, " is ", s[i]);
}
}

Both functions are incorrect, albeit in different ways. The only
improvement I'm seeing is that the user needs to write codeUnits
instead of length, which may make her think twice. Clearly, however,
copiously incorrect code can be written with the proposed interface
because it tries to hide the reality that underneath a variable-length
encoding is being used, but doesn't hide it completely (albeit for
good efficiency-related reasons).

You might be looking at my previous version. The new version (recently
posted) will throw an exception for that code if a multi-code-unit
code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.


It also supports this:

foreach(i, d; s)
{
writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output:

        The character in position 0 is t
        The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to