Re: The Case For Autodecode

Steven Schveighoffer via Digitalmars-d Fri, 03 Jun 2016 13:21:06 -0700

On 6/3/16 3:52 PM, ag0aep6g wrote:

On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:

Except many chars *do* properly convert. This should work:


char c = 'a';
dchar d = c;
assert(d == 'a');


Yeah, that's what I meant by "standalone code unit". Code units that on
their own represent a code point would not be touched.

But you can get a standalone code unit that is part of a coded sequencequite easily


foo(string s)
{
   auto x = s[0];
   dchar d = x;
}

As I mentioned in my earlier reply, some kind of "bounds checking" for
the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
}


So when the char's most significant bit is set, this fills the upper
bits of the dchar with 1s, right? And a set most significant bit in a
char means it's part of a multibyte sequence, while in a dchar it means
that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.


An interesting thing is that I think the CPU can do this for us.

Does it work for for char -> wchar, too?

It does not. 0xffff is a valid code point, and I think so are all theother values that would result. In fact, I think there are no invalidcode units for wchar. Of course, a surrogate pair requires another codeunit to be valid, so we can at least promote a char to a wchar in thesurrogate pair range (and always in the low or high surrogate range so anaive transcoding of a char range to wchar will result in an invalidsequence if there are any non-ascii characters).


So we need most efficient logic that does this:

if(c & 0x80)
    return wchar(0xd800 + c);
else
    return wchar(c);

More expensive, but more correct!

wchar to dchar conversion is pretty sound, as the surrogate pairs areinvalid code points for dchar.


-Steve

Re: The Case For Autodecode

Reply via email to