Re: Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu Wed, 03 Feb 2010 22:30:18 -0800

Rainer Deyke wrote:

Andrei Alexandrescu wrote:

Arrays of char and wchar are not quite generic - they are definitely UTF
strings.


A 'char' is a single utf-8 code unit.  A 'char[]' is (or should be) a
generic array of utf-8 code units.  Sometimes these code units line up
to form valid unicode code points, sometimes they don't.

If you want a data type that always contains a valid utf-8 string, don't
call it 'char[]'.  It's misleading, it breaks generic code, and it
renders built-in arrays useless for when you actually want an array of
utf-8 code units.  It's the same mistake as std::vector<bool> in C++,
but much worse.

I agree up to the assessment of the size of the problem and a couple ofother points. I've had a great time writing utf code in D with string.Getting back to C++'s std::string really put things in perspective.

If your purpose is to store some disparate utf-8 code units (a need thatI've never had), I see no problem with storing then as ubyte[].



Andrei

Re: Making all strings UTF ranges has some risk of WTF

Reply via email to