Tom Lane wrote:
Note that the reference to byte order betrays the implicit context
assumption: that we're talking about UTF16 or UTF32 representation.
Note that there is no implicit context assumption in the Unicode FAQ. It's equally covering UTF-8, UTF-16 and UTF-32.
Another quote:
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes /no/ difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is /only/ used as a signature --- an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used/transparently/ in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

BOM is useless in UTF8, no matter what Microsoft thinks.  Any tool that
relies on it to detect UTF8 data has to have a workaround for overriding
that detection, or it's broken to the point of uselessness.
This kind of brokenness is currently existing the other way around (see my reference to the perl script I' using to work aound it).

Note also that I'm not citing a Microsoft FAQ but the Unicode FAQ.
I'm also not trying to convert Postgres into a Microsoft tool (I'm pretty happy it isn't) but I'm pointing to existing compatibility issues on a Platform that others have decided to support. Belonging to the huge group of users who have little or no choice in what OS they are using and being from a country where plain ASCII isn't enough to cover all existing characters this is probably fair.

It's a pity that the Unicode standard actually allows something that can cause problems but blaming the non-platform again doesn't solve the existing issues.

Regards,

Brar

Reply via email to