Philippe Verdy said: > Well then I don't know why you need a definition of an "Unicode 16-bit > string". For me it just means exactly the same as "16-bit string", and > the encoding in it is not relevant given you can put anything in it > without even needing to be conformant to Unicode. So a Java string is > exactly the same, a 16-bit string. The same also as Windows API 16-bit > strings, or "wide strings" in a C compiler where "wide" is mapped by a > compiler option to 16-bit code units for wchar_t ...
And elaborating on Mark's response a little: [0x0061,0x0062,0x4E00,0xFFFF,0x0410] Is a "Unicode 16-bit string". It contains "a", "b", a Han character, a noncharacter, and a Cyrillic character. Because it is also well-formed as UTF-16, it is also a "UTF-16 string", by the definitions in the standard. [0x0061,0xD800,0x4E00,0xFFFF,0x0410] Is a "Unicode 16-bit string". It contains "a", a high-surrogate code unit, a Han character, a noncharacter, and a Cyrillic character. Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is *NOT* a "UTF-16 string". On the other hand, consider: [0x0061,0x0062,0x88EA,0x8440] That is *NOT* a Unicode 16-bit string. It contains "a", "b", a Han character, and a Cyrillic character. How do I know? Because I know the character set context. It is a wchar_t implementation of the Shift-JIS code page 932. The difference is the declaration of the standard one uses to interpret what the 16-bit units mean. In a "Unicode 16-bit string" I go to the Unicode Standard to figure out how to interpret the numbers. In a "wide code Page 932 string" I go to the specification of Code Page 932 to figure out how to interpret the numbers. This is no different, really, than talking about a "Latin-1 string" versus a "KOI-8 string". --Ken