Mark Davis <[EMAIL PROTECTED]> wrote: >> I must not *call* the sequence "UTF-16," since that term is officially >> reserved for BOM-marked text which can be either little- or big-endian, >> or BOMless text which must be big-endian. > > Yes, assuming the "BUT" clause applies to (b). That is, the untagged > byte sequence > > 0x4B 0x00 0x65 0x00 0x6E 0x00 > > could be > (a) U+4B00 U+6500 U+6E00 ("䬀攀渀"): "UTF-16BE" or "UTF-16" > (b) U+004B U+0065 U+006E ("Ken"): "UTF-16LE" > (c) U+004B U+0000 U+0065 U+0000 U+006E U+0000 > ("K<null>e<null>n<null>"): ASCII, UTF-8, CP-1252, etc. > (d) ...: EBCDEC
Yes, that's what I meant to say. > Not really arguing, just exploring the issues. But one key is that if > you are in an environment where untagged data is being exchanged (a > bad idea, anyway), But not all mechanisms for exchanging data allow tagging. (Bumper sticker: "UNTAGGED TEXT HAPPENS") Here's what caused me to exhume this discussion. Ken made a joke: > -- K '\0' e '\0' n '\0' (which I enjoyed) in response to the "UNICODE BOMBER STRIKES AGAIN" satire about "blank squares" infiltrating otherwise good text. This representation of "Ken" in untagged, little-endian UTF-16, misinterpreted as a sequence of 8-bit characters, corresponds to Mark's example (c) above. It *is* a misinterpretation, right? You're not really supposed to read this sequence of six bytes as K '\0' e '\0' n '\0'. That was the whole joke. And in fact, there is only one "correct" interpretation in this example (that is, only one interpretation that matches the sender's intent), and that is U+004B U+0065 U+006E. I contend that U+4B00 U+6500 U+6E00, whether it makes sense semantically in Chinese or not, is just as incorrect in this context as an ASCII, EBCDIC, FIELDATA, or BOCU-1 reading. Note that everything I said before about this example is true: - there is no BOM - there is no external tagging as "UTF-16LE" (or anything else) - we don't know the native byte orientation of the sender's machine There's a lot of text like this out there, not all of which is intended as jokes or even illustrations. The Unix and Linux world is very opposed to the use of BOM in plain-text files, and if they feel that way about UTF-8 they probably feel the same about UTF-16. Note also that heuristics in an example like this can be deceiving. A famous heuristic that applies to this example is to notice that every other byte is 0, and therefore treat the text as UTF-16LE. For example, one could take the big-endian interpretation (U+4B00 U+6500 U+6E00), notice that all of these characters are CJK ideographs, and use that to deduce (incorrectly) that the text should be UTF-16BE. What if the text were reversed? ('\0' K '\0' e '\0' n) The latter heuristic would suggest that the text should be UTF-16LE. Heuristics are not perfect, but sometimes they're all we've got. So Ken's joke is encoded in BOMless, little-endian, non-externally-tagged UTF-16. It's a perfectly legal Unicode representation, but we can't call it "UTF-16" because that term implies big-endian. This sounds legalistic, sort of like the warnings on the Unicode Web site about the correct use of the word "Unicode." But at least I think I understand the issues a little better, and so the exploration effort paid off. -Doug Ewell Fullerton, California