On Wednesday 2002-04-10, Kenneth Whistler <[EMAIL PROTECTED]> wrote:

> There, feel better?

Not really.  I'm getting the sense on one hand that UTF-16, sans BOM,
can be big-endian or little-endian depending on the platform, on the
other hand that little-endian UTF-16 isn't "legal" unless it has a BOM,
and on the third hand (!) that all this still hasn't been fully thought
out.

(In the following text, I will deliberately spell out "big-endian" and
"little-endian" instead of using the handy abbreviations "BE" and "LE,"
because those refer to the specifically defined encoding schemes
UTF-16BE and UTF-16LE and I don't always mean to do that.)

> * In UTF-16, <004D 0061 0072 006B> is serialized as
> <FF FE 4D 00 61 72 00 6B 00>, <FE FF 00 4D 00 61 00 72 00 6B>, or
> <00 4D 00 61 00 72 00 6B>.
>
> The third instance cited above is the *unmarked* case -- what
> you get if you have no explicit marking of byte order with the BOM
> signature. The contrasting byte sequence <4D 00 61 72 00 6B 00>
> would be illegal in the UTF-16 encoding scheme.

You mean because of the missing 00 byte?  (Rim shot.)

> [It is, of course,
> perfectly legal UTF-16LE.]

I don't know, looks to me like a perfectly good sequence of four CJK
ideographs.  (Rim shot.)

No, but seriously, folks.  Can we interpret the UTF-16 encoding
*scheme* -- we're not talking about *form* here, since that has nothing
to do with byte order -- as being platform-endian, or does it absolutely
have to be big-endian?  Because if it has to be big-endian, even on a
little-endian platform, then there's an awful lot of non-conformant
"UTF-16" lurking around in Windows NT (e.g. NTFS filenames).

> The intent of all this is if you run into serialized UTF-16 data,
> in the absence of any other information, you should assume and
> interpret it as big-endian order. The "other information" (or
> "higher-level protocol") could consist of text labelling (as
> in MIME labels) or other out-of-band information. It could even
> consist of just knowing what the CPU endianness of the platform
> you are running on is (e.g., knowing whether you are compiled
> with BYTESWAP on or off :-) ). And, of course, it is always
> possible for the interpreting process to perform a data heuristic
> on the byte stream, and use *that* as the other information to
> determine that the byte stream is little-endian UTF-16 (i.e.
> UTF-16LE), rather than big-endian.

That's quite different from Yves' original statement that "UTF-16 is
big-endian unless a BOM is present."

> And a lot of the text in the standard about being neutral between
> byte orders is the result of the political intent of the standard,
> way back when, to deliberately not favor either big-endian or
> little-endian CPU architectures, and to allow use of native
> integer formats to store characters on either platform type.

This is a bit troubling.  It seems to imply that the decision "way back
when" to be neutral about byte order was merely a political gesture to
get the little-endian guys on board, and that the rules are changing
somewhat to favor the big-endian guys.

> Again, as for many of these kinds of issues being discovered by
> the corps of Unicode exegetes out there, part of the problem is
> the distortion that has set in for the normative definitions in
> the standard as Unicode has evolved from a 16-bit encoding to
> a 21-bit encoding with 3 encoding forms and 7 encoding schemes.

No argument there.  There are still plenty of common-man
interpretations, and plenty of text in TUS 3.0, that treat UTF-16 as the
"one true" encoding form of Unicode.  I know this is being cleaned up
for 4.0; I just hope public perceptions will follow.

> For the UTF-16 character encoding *form*:
>
> "D32 <ital>UTF-16 character encoding form:</ital> the Unicode
> CEF which assigns each Unicode scalar value in the ranges U+0000..
> U+D7FF and U+E000..U+FFFF to a single 16-bit code unit with the
> same numeric value as the Unicode scalar value, and which assigns
> each Unicode scalar value in the ranges U+10000..U+10FFFF to a
> surrogate pair, according to Table 3-X.
>
>   * In UTF-16, <004D, 0430, 4E8C, 10302> is represented as
>     <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds
>     to U+10302."

Fine.  I don't think there are any questions concerning UTF-16 as a CEF.

> For the UTF-16 character encoding *scheme*:
>
> "D43 <ital>UTF-16 character encoding scheme:</ital> the Unicode
> CES that serializes a UTF-16 code unit sequence as a byte sequence
> in either big-endian or little-endian format.
>
>   * In UTF-16 (the CES), the UTF-16 code unit sequence
>     <004D 0430 4E8C D800 DF02> is serialized as
>     <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or
>     <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or
>     <00 4D 04 30 4E 8C D8 00 DF 02>."

Here the draft text is saying in the description that UTF-16 can be
either big-endian or little-endian, and can include a BOM or omit it.
Four possibilities.  Good.  But then the examples leave out the non-BOM
little-endian serialization, which implies it is not conformant like the
other three.  Not so good, because (a) the description and examples
don't really match and (b) the examples rule out the possibility of
UTF-16 text that we might know darn well to be little-endian, not
because of a BOM but perhaps because of the other indicators Ken
mentioned: MIME labeling, knowledge of the originating platform,
heuristics, etc.

The exegesis continues....

-Doug Ewell
 Fullerton, California



Reply via email to