On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
But you can get a standalone code unit that is part of a coded sequence
quite easily

foo(string s)
{
    auto x = s[0];
    dchar d = x;
}

I don' think we're disagreeing on anything.

I'm calling UTF-8 code units below 0x80 "standalone" code units. They're never part of multibyte sequences. Your _dchar_convert returns them unscathed.

Higher code units are always part of multibyte sequences (or invalid already). Your function returns invalid code points for them.

_dchar_convert does exactly what I meant, except that I had in mind returning the replacement character for non-standalone code units. But I see that that may not be feasible, and it's probably not necessary.

[...]
So we need most efficient logic that does this:

if(c & 0x80)
     return wchar(0xd800 + c);

Is this going to be faster than returning a constant invalid wchar?

else
     return wchar(c);

More expensive, but more correct!

wchar to dchar conversion is pretty sound, as the surrogate pairs are
invalid code points for dchar.

-Steve

Reply via email to