> This is incorrect. Here is a summary of the meaning of those bytes at
> the start of text files with different Unicode encoding forms.
> 
> beginning with bytes FE FF:
> - UTF-16 => big endian, omitted from contents
> 
> beginning with bytes FF FE:
> - UTF-16 => little endian, omitted from contents

Unfortunately this breaks with popular Unicode libraries like ICU (I am
Cc:ing them here, since I have the opportunity to raise this again), where
UTF-16 is mapped to the platform endian form:

(From ICU's convrtrs.txt file:)

# The ICU UTF-16 converter uses the current platform's endianness.
# It does not autodetect endianness from a BOM.
UTF-16 { MIME }          UTF16_PlatformEndian ISO-10646-UCS-2 { IANA }
csUnicode ibm-17584 ibm-13488 ibm-1200 cp1200 ucs-2

(End of excerpt.)

This is typically *very* confusing to new users of Unicode. I wish such
libraries used only a "UTF-16PE" denomination for such a converter, and
handled UTF-16 as a converter per the expectations that Mark described well
in his explanation of how to interpret a FF FE / FE FF sequence of bytes.
Otherwise you end up having people properly label "UTF-16" some UTF-16 with
a BOM, and naive code using the library's UTF-16 converter (sounds
appropriate, right?) fail to decode data properly.

In the context of ICU, it's one of my favorite pet peeves, especially since
the ICU is usually so a*al about being very strict as far as the
interpretation of a given charset name goes. The last time I read the
Unicode standard UTF-16 was big endian unless a BOM was present, and that's
what I expected from a UTF-16 converter.

YA


Reply via email to