Yves wrote, in response to Doug: > > > The last time I read the Unicode standard UTF-16 was big endian > > > unless a BOM was present, and that's what I expected from a UTF-16 > > > converter. > > > > Conformance requirement C2 (TUS 3.0, p. 37) says: > > > > "The Unicode Standard does not specify any order of bytes inside a > > Unicode value." > > (I posted the previous email hastily it seems.) > > But wait. Same page, 3 lines below, conformance requirement C3 says: > > "A process shall interpret a Unicode value that has been serialized into a > sequence of bytes by most significant byte first, in the absence of > higher-level protocols." > > I read this as saying that by default the byte ordering is big endian. Don't > you?
There is a problem here in that "by default" can be interpreted in different ways, leading to potential confusion. The key point is in D35, p. 47 of TUS 3.0: * In UTF-16, <004D 0061 0072 006B> is serialized as <FF FE 4D 00 61 72 00 6B 00>, <FE FF 00 4D 00 61 00 72 00 6B>, or <00 4D 00 61 00 72 00 6B>. The third instance cited above is the *unmarked* case -- what you get if you have no explicit marking of byte order with the BOM signature. The contrasting byte sequence <4D 00 61 72 00 6B 00> would be illegal in the UTF-16 encoding scheme. [It is, of course, perfectly legal UTF-16LE.] The intent of all this is if you run into serialized UTF-16 data, in the absence of any other information, you should assume and interpret it as big-endian order. The "other information" (or "higher-level protocol") could consist of text labelling (as in MIME labels) or other out-of-band information. It could even consist of just knowing what the CPU endianness of the platform you are running on is (e.g., knowing whether you are compiled with BYTESWAP on or off :-) ). And, of course, it is always possible for the interpreting process to perform a data heuristic on the byte stream, and use *that* as the other information to determine that the byte stream is little-endian UTF-16 (i.e. UTF-16LE), rather than big-endian. And a lot of the text in the standard about being neutral between byte orders is the result of the political intent of the standard, way back when, to deliberately not favor either big-endian or little-endian CPU architectures, and to allow use of native integer formats to store characters on either platform type. Again, as for many of these kinds of issues being discovered by the corps of Unicode exegetes out there, part of the problem is the distortion that has set in for the normative definitions in the standard as Unicode has evolved from a 16-bit encoding to a 21-bit encoding with 3 encoding forms and 7 encoding schemes. To lift the veil again a little on the Unicode 4.0 editorial work -- here, for example, is some suggested text that the editorial committee is working on to clarify the UTF-16 encoding form and the UTF-16 encoding scheme. [This text is suggested draft only, so don't go running off claiming conformance to it yet!] For the UTF-16 character encoding *form*: "D32 <ital>UTF-16 character encoding form:</ital> the Unicode CEF which assigns each Unicode scalar value in the ranges U+0000.. U+D7FF and U+E000..U+FFFF to a single 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicode scalar value in the ranges U+10000..U+10FFFF to a surrogate pair, according to Table 3-X. * In UTF-16, <004D, 0430, 4E8C, 10302> is represented as <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302." For the UTF-16 character encoding *scheme*: "D43 <ital>UTF-16 character encoding scheme:</ital> the Unicode CES that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. * In UTF-16 (the CES), the UTF-16 code unit sequence <004D 0430 4E8C D800 DF02> is serialized as <FE FF 00 4D 04 30 4E 8C D8 00 DF 02> or <FF FE 4D 00 30 04 8C 4E 00 D8 02 DF> or <00 4D 04 30 4E 8C D8 00 DF 02>." etc., etc. There, feel better? --Ken