Apparently my previous post was lost. Apologies if this comes out twice.

On 12/28/2011 09:39 PM, Jonathan M Davis wrote:
On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
Why? char and wchar are unicode code units, ubyte/ushort are unsigned
integrals. It is clear that char/wchar are a better match.

It's an issue of the correct usage being the easy path. As it stands, it's
incredibly easy to use narrow strings incorrectly. By forcing any array of
char or wchar to use .rep.length instead of .length, the relatively automatic
(and generally incorrect) usage of .length on a string wouldn't immediately
work. It would force you to work more at doing the wrong thing. Unfortunately,
walkLength isn't necessarily any easier than .rep.length, but it does force
people to look into why they can't do .length, which will generally better
educate them and will hopefully reduce the misuse of narrow strings.


I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.

If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then
we reinforce the fact that you shouldn't operate on chars or wchars.

There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.

It also
makes it simply for the compiler to never allow you to use length on char[] or
wchar[], since it doesn't have to worry about whether you got that char[] or
wchar[] from a rep property or not.

Now, I don't know if this is really a good move at this point. If we were to
really do this right, we'd need to disallow indexing and slicing of the char[]
and wchar[] as well, which would break that much more code. It also pretty
quickly makes it look like string should be its own type rather than an array,
since it's acting less and less like an array.

Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.

Not to mention, even the
correct usage of .rep would become rather irritating (e.g. slicing it when you
know that the indicies that you're dealing with aren't going to cut into any
code points), because you'd have to cast from ubyte[] to char[] whenever you
did that.

So, I think that the general sentiment behind this is a good one, but I don't
know if the exact idea is ultimately a good one - particularly at this stage
in the game. If we're going to make a change like this which would break as
much code as this would, we'd need to be _very_ certain that it's what we want
to do.

- Jonathan M Davis

I agree.

Reply via email to