On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote: > UTF-8 with BOM is the Microsoft preferred format.
I believe this is a gloss. Microsoft uses UTF-16. Because the basic character unit is larger than one byte it is crucial for interoperability to prefix a string of UTF-16 text with an indication of the order of bytes in each two byte unit. This is the role of the BOM. The BOM is not part of the text. It is a wrapper or envelope. It is a mistake on Microsoft's part to fail to strip the BOM during conversion to UTF-8. There is no MEANINGFUL definition of BOM in a UTF-8 string. But instead of stripping the wrapper and converting only the text payload Microsoft lazily treats both the wrapper and its payload as text. You can see the logical fallacy if you imagine emitting UTF-16 text in an environment of one byte sex, reducing that text to UTF-8, carrying it to an environment of the other byte sex and raising it back to UTF-16. The Unicode.org assumption is that on generation one organizes the bytes of UTF-16 or UTF-32 units according to what is most convenient for a given environment. One prefixes a BOM to text objects to be persisted or passed to differing byte-sex environments. Such an object is not a string but a means of inter-operation. If the BOMs are not stripped during reduction to UTF-8 and are reconstituted during raising to UTF-16 or UTF-32 then raising must honor the BOM and the Unicode.org efficiency objective is subverted. You can take this further and imagine concatenating two UTF-8 strings, one originally UTF-16 generated in a little-endian environment, the other originally UTF-16 generated in a big- endian environment. If the BOMs are not pre-stripped then during raising of the concatenated result to UTF-16 you will get an object with embedded BOMs. This is not meaningful. What does it mean within a UTF-16 string to encounter a BOM that contradicts the wrapper/envelope? Does this mean that any correct UTF-16 utility much cope with hybrid object whose byte order potentially changes mid-stride? /john, who has written a database loader that has to contend with (and clearly diagnoses) BOM in UTF-8 strings. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
