Re: Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu Thu, 04 Feb 2010 15:20:30 -0800

Don wrote:

Andrei Alexandrescu wrote:
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better ideacomes around. As other times, I operated the change to see howthings feel with the new approach.
Has any thought been given to foreach? Currently all these work forstrings:
    foreach (c; "abc") { } // typeof(c) is 'char'
    foreach (char c; "abc") { }
    foreach (wchar c; "abc") { }
    foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type isimplicit. The implicit element type is (currently) the code units. Ifthe range use code points 'dchar' as the element type, then I thinkforeach needs to be changed so that the default element type is'dchar' too (in the first line of my example). Having ranges andforeach disagree on this would be very inconsistent. Of course youshould be allowed to iterate using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learningD and I noticed that foreach didn't do this, that I had to explicitlyhas for it.
This is a good point. I'm in favor of changing the language to makethe implicit type dchar.
Andrei
We seem to be approaching the point where char[], wchar[] and dchar[]are all arrays of dchar, but with different levels of compression.


That is a good way to look at things.

It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should bepermitted ONLY if char can be implicitly converted to dchar. Otherwise,you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which willnot necessarily result in a valid unicode string.

Well as it's been mentioned, sometimes you may assemble a string out ofindividual characters. Probably that case is rare enough to warrant acast. Note that today char is already convertible to dchar (there's nochecking).

I suspect that string, wstring should have been the primary types andhad a .codepoints property, which returned a ubyte[] resp. ushort[]reference to the data. It's too late, of course. The extra value you getby having a specific type for 'this is a code point for a UTF8 string'seems to be very minor, compared to just using a ubyte.

What we can do is to have to!(const ubyte[]) work for all UTF8 stringsand to!(const ushort[]) work for all UTF16 strings. That view is correctand safe. Also, it's not difficult to add a .codepoints pseudo-property.



Andrei

Re: Making all strings UTF ranges has some risk of WTF

Reply via email to