Don wrote:
Andrei Alexandrescu wrote:
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu <seewebsiteforem...@erdani.org> said:

bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.

Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)

I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.

Has any thought been given to foreach? Currently all these work for strings:

    foreach (c; "abc") { } // typeof(c) is 'char'
    foreach (char c; "abc") { }
    foreach (wchar c; "abc") { }
    foreach (dchar c; "abc") { }

I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too.

I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.

This is a good point. I'm in favor of changing the language to make the implicit type dchar.

Andrei

We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression.

That is a good way to look at things.

It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string.

Well as it's been mentioned, sometimes you may assemble a string out of individual characters. Probably that case is rare enough to warrant a cast. Note that today char is already convertible to dchar (there's no checking).

I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.

What we can do is to have to!(const ubyte[]) work for all UTF8 strings and to!(const ushort[]) work for all UTF16 strings. That view is correct and safe. Also, it's not difficult to add a .codepoints pseudo-property.


Andrei

Reply via email to