On 23/04/15 19:14, Jim Mooney wrote:
By relying on the default when you read it, you're making an unspoken
assumption about the encoding of the file.
So is there any way to sniff the encoding, including the BOM (which appears
to be used or not used randomly for utf-8), so you can then use the proper
encoding, or do you wander in the wilderness?
Pretty much guesswork.
The move from plain old ASCII to Unicode (and others) has made the
handling of text much more like binary. You have to know the binary
format/encoding to know how to decode binary data. Its the same with
text, if you don't know what produced it, and in what format, then you
have to guess.
There are some things you can do to check your results (such as try
spell checking the results) and you can try checking the characters
against the Unicode mappings to see if the sequences look sane.
(for example a lot of mixed alphabets - like arabic, greek and
latin - suggests you guessed wrong!) But none of it is really
reliable.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor