RE: What does it mean to "not be a valid string in Unicode"?

Whistler, Ken Mon, 07 Jan 2013 13:41:27 -0800

Philippe Verdy said:

> Well then I don't know why you need a definition of an "Unicode 16-bit
> string". For me it just means exactly the same as "16-bit string", and
> the encoding in it is not relevant given you can put anything in it
> without even needing to be conformant to Unicode. So a Java string is
> exactly the same, a 16-bit string. The same also as Windows API 16-bit
> strings, or "wide strings" in a C compiler where "wide" is mapped by a
> compiler option to 16-bit code units for wchar_t ...


And elaborating on Mark's response a little:

[0x0061,0x0062,0x4E00,0xFFFF,0x0410]

Is a "Unicode 16-bit string". It contains "a", "b", a Han character, a 
noncharacter, and a Cyrillic character.

Because it is also well-formed as UTF-16, it is also a "UTF-16 string", by the 
definitions in the standard. 

[0x0061,0xD800,0x4E00,0xFFFF,0x0410]

Is a "Unicode 16-bit string". It contains "a", a high-surrogate code unit, a 
Han character, a noncharacter, and a Cyrillic character.

Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is 
*NOT* a "UTF-16 string".

On the other hand, consider:

[0x0061,0x0062,0x88EA,0x8440]

That is *NOT* a Unicode 16-bit string. It contains "a", "b", a Han character, 
and a Cyrillic character. How do I know? Because I know the character set 
context. It is a wchar_t implementation of the Shift-JIS code page 932.

The difference is the declaration of the standard one uses to interpret what 
the 16-bit units mean. In a "Unicode 16-bit string" I go to the Unicode 
Standard to figure out how to interpret the numbers. In a "wide code Page 932 
string" I go to the specification of Code Page 932 to figure out how to 
interpret the numbers.

This is no different, really, than talking about a "Latin-1 string" versus a 
"KOI-8 string".

--Ken

RE: What does it mean to "not be a valid string in Unicode"?

Reply via email to