Mark Davis <[EMAIL PROTECTED]> wrote: > Part of the problem is that the term "UTF-16" means two different > things. Let me see if I can make it clearer. > > Let "UTF-16M" refer to the in-memory form, which is sequence of 16- > bit code units. The byte ordering is logically immaterial, since it > is not a sequence of bytes. Such a sequence does not use a BOM. The > code point sequence <U+1234 U+0061 U+10000> is represented as the > UTF-16M sequence <0x1234 0x0061 0xD800 0xDC00>. > > Let "UTF-16", on the other hand, refer to only the byte-serialized > form.
I think I understand the difference between the CEF called "UTF-16" and the CES called "UTF-16." That isn't where I'm having a problem. > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented > as one of: > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOB *This* is where I'm having a problem. Mark states here, again, that BOM-less UTF-16 (the CES) must be big-endian. That is: <0x34 0x12 0x61 0x00 0x00 0xD8 0x00 0xDC> // MOBless is not an instance of any valid CES. That, to me, is a change from what Unicode has stated before, and from what Ken just said about using "other information" (which could include external tagging, knowledge of the originating platform, or heuristics) to determine the intended byte order. Remember, I like the BOM. I happen to think it's a useful indicator of both file type and byte order (not really two different topics). But I do think the official deprecation, or omission from mention, of BOM-less little-endian UTF-16 is a change from past definitions that renders nonconformant a potentially large amount of existing UTF-16 data. -Doug Ewell Fullerton, California