Don wrote:
Andrei Alexandrescu wrote:
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how
things feel with the new approach.
Has any thought been given to foreach? Currently all these work for
strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is
implicit. The implicit element type is (currently) the code units. If
the range use code points 'dchar' as the element type, then I think
foreach needs to be changed so that the default element type is
'dchar' too (in the first line of my example). Having ranges and
foreach disagree on this would be very inconsistent. Of course you
should be allowed to iterate using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning
D and I noticed that foreach didn't do this, that I had to explicitly
has for it.
This is a good point. I'm in favor of changing the language to make
the implicit type dchar.
Andrei
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
That is a good way to look at things.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
Well as it's been mentioned, sometimes you may assemble a string out of
individual characters. Probably that case is rare enough to warrant a
cast. Note that today char is already convertible to dchar (there's no
checking).
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
What we can do is to have to!(const ubyte[]) work for all UTF8 strings
and to!(const ushort[]) work for all UTF16 strings. That view is correct
and safe. Also, it's not difficult to add a .codepoints pseudo-property.
Andrei