Well then I don't know why you need a definition of an "Unicode 16-bit string". For me it just means exactly the same as "16-bit string", and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a 16-bit string. The same also as Windows API 16-bit strings, or "wide strings" in a C compiler where "wide" is mapped by a compiler option to 16-bit code units for wchar_t (or "short" but more safely as UINT16 if you don't want to be dependant of compiler options or OS environments when compiling, when you need to manage the exact memory allocation), or the same as a U-string in Perl.
Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with concreate byte orders, without any leading BOM) is relevant to Unicode because a 16-bit string does not itself specify any encoding scheme or byte order. One confusion comes with the name "UTF-16" when it is also used as an encoding scheme with a possible leading BOM and implied default UTF-16LE determined by guesses on the first few characters : this encoding scheme (with support of BOM and implicit guess of byte order if it's missing) should have been given a distinct encoding name like "'UTF-16XE". Reserving "UTF-16" for what the stadnard discusses as a "16-bit string", except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) plus **no** BOM supported for this level (which is still not materialized by a concrete byte order or by an implicit size in storage bits, as long as it can store distinctly the whole range of code units 0x0000..0xFFFF minus the few non-characters, and enforces all surrogates to be paired, but does not enforce any character to be allocated). Note that such relaxed version of UTF-16 would still allow an internal alternate representation of 0x0000 for interoperating with various APIs without changing the storage requirement : 0xFFFF could perfectly be used to replace 0x0000 if that last code units plays a special role as a string terminator. But even if this is done, a storage unit like 0xFFFF would still be percied as if it was really the code unit 0x0000. In other words, the concept of completely relaxed "Unicode 16-bit string" is unneeded, given that it's single requirement is to make sure that it defines a length in terms of 16-bit code units, and code units being large enough to store any unsigned 16-bit value (internally it could still be 18-bit on systems with 6-bit or 9-bit addressable memory cells ; the sizeof() property of this code units could still be 2, or 3 or other, as long as it is large enough to store the value. On some devices (not so exotic...) there are memory areas that is 4-bit addressable or even 1-bit addressable (in that later case the sizeof() property for the code unit type would return 16, not 2). Some devices only have 16-bit or 32-bit addressable memory and sizeof() would return 1 (and the C types char and wchar_t would most likely be the same). 2013/1/7 Doug Ewell <d...@ewellic.org>: > You're right, and I stand corrected. I read Markus's post too quickly. > > Mark Davis ☕ <mark at macchiato dot com> wrote: > >>> But still non-conformant. >> >> That's incorrect. >> >> The point I was making above is that in order to say that something is >> "non-conformant", you have to be very clear what it is "non-conformant" TO. >> >>> Also, we commonly read code points from 16-bit Unicode strings, and >>> unpaired surrogates are returned as themselves and treated as such >>> (e.g., in collation). >> >> + That is conformant for Unicode 16-bit strings. >> >> + That is not conformant for UTF-16. >> >> There is an important difference. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > >