On 09/19/2011 04:43 PM, Steven Schveighoffer wrote:
On Mon, 19 Sep 2011 10:24:33 -0400, Timon Gehr <timon.g...@gmx.ch> wrote:

On 09/19/2011 04:02 PM, Steven Schveighoffer wrote:

So I think it's not only limiting to require x.length to be $, it's very
wrong in some cases.

Also, think of a string. It has no length (well technically, it does,
but it's not the number of elements), but it has a distinct end point. A
properly written string type would fail to compile if $ was s.length.


But you'd have to compute the length anyways in the general case:

str[0..$/2];

Or am I misunderstanding something?


That's half the string in code units, not code points.

If string was properly implemented, this would fail to compile. $ is not
the length of the string range (meaning the number of code points). The
given slice operation might actually create an invalid string.

Programmers have to be aware of that if they want efficient code that deals with unicode. I think having random access to the code units and being able to iterate per code point is fine, because it gives you the best of both worlds. Manually decoding a string and slicing it at positions that were remembered to be safe has been good enough for me, at least it is efficient.


It's tricky, because you want fast slicing, but only certain slices are
valid. I once created a string type that used a char[] as its backing,
but actually implemented the limitations that std.range tries to enforce
(but cannot). It's somewhat of a compromise. If $ was mapped to
s.length, it would fail to compile, but I'm not sure what I *would* use
for $. It actually might be the code units, which would not make the
above line invalid.

-Steve

Well it would have to be consistent for a string type that "does it right" . Either the string is indexed with units or it is indexed with code points, and the other option should be provided. Dollar should just be the length of what is used for indexing/slicing here, and having that be different from length makes for a somewhat awkward interface imho.

Btw, D double-quoted string literals let you define invalid byte sequences with eg. octal literals:
string s="\377";

What would be use cases for that? Shouldn't \377 map to the extended ascii charset instead and yield the same code point that would be given in C dq strings?




Reply via email to