On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu
<seewebsiteforem...@erdani.org> said:
Rainer Deyke wrote:
Don wrote:
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely. If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
That's a nice concept, but it seems to me that it adds much overhead to
improve a rather niche area. It's not often that you need to access
characters by index. Generally when you need to it's because you've
already parsed the string and want to return to a previous location, in
which case you'd better when you first parse to just save the range or
the index in code units rather than the index in code point.
But I have to say quite satisfied in the way D handle strings in
general. Easy access to code points and direct access to the data is
quite handy. I think it fits very well with a low-level language.
I'd say in general when manipulating strings I rarely need to bother
about code points. Most of the time I'm just searching for ASCII-range
markers when parsing so I can search for them directly as code units,
not bothering at all about multi-byte characters. That's why I'm a
little wary about your changes. If I'm looking for a substring then I
can search by code units too. It's just for the more fancy stuff
(case-insensitive searching, character transformation) that it becomes
necessary to work with code points.
That's why I'm a little wary about your changes in that area: I fear
it'll make the common case of working with code units more difficult to
deal with. But I won't judge before I see.
--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/