Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

Brar Piening Mon, 26 Sep 2011 22:50:33 -0700

Tom Lane wrote:

Note that the reference to byte order betrays the implicit context
assumption: that we're talking about UTF16 or UTF32 representation.

Note that there is no implicit context assumption in the Unicode FAQ.It's equally covering UTF-8, UTF-16 and UTF-32.

Another quote:

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? Ifyes, then can I still assume the remaining UTF-8 bytes are in big-endianorder?A: Yes, UTF-8 can contain a BOM. However, it makes /no/ difference as tothe endianness of the byte stream. UTF-8 always has the same byte order.An initial BOM is /only/ used as a signature --- an indication that anotherwise unmarked text file is in UTF-8. Note that some recipients ofUTF-8 encoded data do not expect a BOM. Where UTF-8 isused/transparently/ in 8-bit environments, the use of a BOM willinterfere with any protocol or file format that expects specific ASCIIcharacters at the beginning, such as the use of "#!" of at the beginningof Unix shell scripts.


BOM is useless in UTF8, no matter what Microsoft thinks.  Any tool that
relies on it to detect UTF8 data has to have a workaround for overriding
that detection, or it's broken to the point of uselessness.

This kind of brokenness is currently existing the other way around (seemy reference to the perl script I' using to work aound it).


Note also that I'm not citing a Microsoft FAQ but the Unicode FAQ.

I'm also not trying to convert Postgres into a Microsoft tool (I'mpretty happy it isn't) but I'm pointing to existing compatibility issueson a Platform that others have decided to support.Belonging to the huge group of users who have little or no choice inwhat OS they are using and being from a country where plain ASCII isn'tenough to cover all existing characters this is probably fair.

It's a pity that the Unicode standard actually allows something that cancause problems but blaming the non-platform again doesn't solve theexisting issues.


Regards,

Brar

Re: [HACKERS] Support UTF-8 files with BOM in COPY FROM

Reply via email to