On 04/23/2015 06:37 AM, Jim Mooney wrote:
..

Ï»¿

is the UTF-8 BOM (byte order mark) interpreted as Latin 1.

If the input is UTF-8 you can get rid of the BOM with

with open("data.txt", encoding="utf-8-sig") as csvfile:


Peter Otten

I caught the bad arithmetic on name length, but where is the byte order
mark coming from? My first line is plain English so far as I can see - no
umlauts or foreign characters.
first_name|last_name|email|city|state or region|address|zip

Is this an artifact of csv module output, or is it the data from
generatedata.com, which looks global? More likely it means I have to figure
out unicode ;'(

A file is always stored as bytes, so if it's a text file, it is always an encoded file (although if it's ASCII, you tend not to think of that much).

So whatever program writes that file has picked an encoding, and when you read it you have to use the same encoding to safely read it into text.

By relying on the default when you read it, you're making an unspoken assumption about the encoding of the file.

There are dozens of common encodings out there, and anytime you get the wrong one, you're likely to mess up somewhere, unless it happens to be pure ASCII.

The BOM is not supposed to be used in a byte encoded file, but Notepad, among other programs does. So it happens to be a good clue that the rest of the file is encoded in utf-8. If that's the case, and if you want to strip the BOM, use utf-8-sig.

Note: the BOM may be legal in utf-8 now, but it was originally intended to distinguish the UTF-32-BE from UTF-32-LE, as well as UTF-16-BE from UTF-16-LE.

https://docs.python.org/2/library/codecs.html#encodings-and-unicode

--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to