Re: Guessing the encoding from a BOM

Mark Lawrence Fri, 17 Jan 2014 01:13:31 -0800

On 17/01/2014 01:40, Tim Chase wrote:

On 2014-01-17 11:14, Chris Angelico wrote:

UTF-8 specifies the byte order
as part of the protocol, so you don't need to mark it.


You don't need to mark it when writing, but some idiots use it
anyway.  If you're sniffing a file for purposes of reading, you need
to look for it and remove it from the actual data that gets returned
from the file--otherwise, your data can see it as corruption.  I end
up with lots of CSV files from customers who have polluted it with
Notepad or had Excel insert some UTF-8 BOM when exporting.  This
means my first column-name gets the BOM prefixed onto it when the
file is passed to csv.DictReader, grr.

-tkc

My code that used to handle CSV files from M$ Money had to allow for asingle NUL byte right at the end of the file. Thankfully I've now movedon to gnucash.

Slight aside, any chance of changing the subject of this thread, or evenending the thread completely? Why? Every time I see it I pictureInspector Clouseau, "A BOM!!!" :)

--

My fellow Pythonistas, ask not what our language can do for you, askwhat you can do for our language.


Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Guessing the encoding from a BOM

Reply via email to