> This is incorrect. Here is a summary of the meaning of those bytes at > the start of text files with different Unicode encoding forms. > > beginning with bytes FE FF: > - UTF-16 => big endian, omitted from contents > > beginning with bytes FF FE: > - UTF-16 => little endian, omitted from contents
Unfortunately this breaks with popular Unicode libraries like ICU (I am Cc:ing them here, since I have the opportunity to raise this again), where UTF-16 is mapped to the platform endian form: (From ICU's convrtrs.txt file:) # The ICU UTF-16 converter uses the current platform's endianness. # It does not autodetect endianness from a BOM. UTF-16 { MIME } UTF16_PlatformEndian ISO-10646-UCS-2 { IANA } csUnicode ibm-17584 ibm-13488 ibm-1200 cp1200 ucs-2 (End of excerpt.) This is typically *very* confusing to new users of Unicode. I wish such libraries used only a "UTF-16PE" denomination for such a converter, and handled UTF-16 as a converter per the expectations that Mark described well in his explanation of how to interpret a FF FE / FE FF sequence of bytes. Otherwise you end up having people properly label "UTF-16" some UTF-16 with a BOM, and naive code using the library's UTF-16 converter (sounds appropriate, right?) fail to decode data properly. In the context of ICU, it's one of my favorite pet peeves, especially since the ICU is usually so a*al about being very strict as far as the interpretation of a given charset name goes. The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. YA