Re: Making all strings UTF ranges has some risk of WTF

Michel Fortin Thu, 04 Feb 2010 19:45:19 -0800

On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu<seewebsiteforem...@erdani.org> said:

Rainer Deyke wrote:

Don wrote:

I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.


If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely.  If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.


One idea I've had for a while was to have a universal string type:

struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
}

That's a nice concept, but it seems to me that it adds much overhead toimprove a rather niche area. It's not often that you need to accesscharacters by index. Generally when you need to it's because you'vealready parsed the string and want to return to a previous location, inwhich case you'd better when you first parse to just save the range orthe index in code units rather than the index in code point.

But I have to say quite satisfied in the way D handle strings ingeneral. Easy access to code points and direct access to the data isquite handy. I think it fits very well with a low-level language.

I'd say in general when manipulating strings I rarely need to botherabout code points. Most of the time I'm just searching for ASCII-rangemarkers when parsing so I can search for them directly as code units,not bothering at all about multi-byte characters. That's why I'm alittle wary about your changes. If I'm looking for a substring then Ican search by code units too. It's just for the more fancy stuff(case-insensitive searching, character transformation) that it becomesnecessary to work with code points.

That's why I'm a little wary about your changes in that area: I fearit'll make the common case of working with code units more difficult todeal with. But I won't judge before I see.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: Making all strings UTF ranges has some risk of WTF

Reply via email to