Re: [Tutor] name shortening in a csv module output

Dave Angel Thu, 23 Apr 2015 10:05:34 -0700

On 04/23/2015 06:37 AM, Jim Mooney wrote:

..

Ï»¿

is the UTF-8 BOM (byte order mark) interpreted as Latin 1.

If the input is UTF-8 you can get rid of the BOM with

with open("data.txt", encoding="utf-8-sig") as csvfile:


Peter Otten

I caught the bad arithmetic on name length, but where is the byte order
mark coming from? My first line is plain English so far as I can see - no
umlauts or foreign characters.
first_name|last_name|email|city|state or region|address|zip

Is this an artifact of csv module output, or is it the data from
generatedata.com, which looks global? More likely it means I have to figure
out unicode ;'(

A file is always stored as bytes, so if it's a text file, it is alwaysan encoded file (although if it's ASCII, you tend not to think of thatmuch).

So whatever program writes that file has picked an encoding, and whenyou read it you have to use the same encoding to safely read it into text.

By relying on the default when you read it, you're making an unspokenassumption about the encoding of the file.

There are dozens of common encodings out there, and anytime you get thewrong one, you're likely to mess up somewhere, unless it happens to bepure ASCII.

The BOM is not supposed to be used in a byte encoded file, but Notepad,among other programs does. So it happens to be a good clue that therest of the file is encoded in utf-8. If that's the case, and if youwant to strip the BOM, use utf-8-sig.

Note: the BOM may be legal in utf-8 now, but it was originally intendedto distinguish the UTF-32-BE from UTF-32-LE, as well as UTF-16-BE fromUTF-16-LE.


https://docs.python.org/2/library/codecs.html#encodings-and-unicode

--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] name shortening in a csv module output

Reply via email to