On 23/04/15 19:14, Jim Mooney wrote:

By relying on the default when you read it, you're making an unspoken
assumption about the encoding of the file.


So is there any way to sniff the encoding, including the BOM (which appears
to be used or not used randomly for utf-8), so you can then use the proper
encoding, or do you wander in the wilderness?

Pretty much guesswork.

The move from plain old ASCII to Unicode (and others) has made the handling of text much more like binary. You have to know the binary format/encoding to know how to decode binary data. Its the same with text, if you don't know what produced it, and in what format, then you have to guess.

There are some things you can do to check your results (such as try spell checking the results) and you can try checking the characters against the Unicode mappings to see if the sequences look sane.
(for example a lot of mixed alphabets - like arabic, greek and
latin - suggests you guessed wrong!) But none of it is really
reliable.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to