On 04/23/2015 02:14 PM, Jim Mooney wrote:
By relying on the default when you read it, you're making an unspoken
assumption about the encoding of the file.
--
DaveA
So is there any way to sniff the encoding, including the BOM (which appears
to be used or not used randomly for utf-8), so you can then use the proper
encoding, or do you wander in the wilderness? I was going to use encoding =
utf-8 as a suggested default. I noticed it got rid of the bom symbols but
left an extra blank space at the beginning of the stream. Most books leave
unicode to the very end, if they mention the BOM at all (mine is at page
977, which is still a bit off ;')
That's not a regular blank, See the link I mentioned before, and the
following sentence:
""" Unfortunately the character U+FEFF had a second purpose as a ZERO
WIDTH NO-BREAK SPACE: a character that has no width and doesn’t allow a
word to be split. It can e.g. be used to give hints to a ligature
algorithm. """
To automatically get rid of that BOM character when reading a file, you
use utf-8-sig, rather than utf-8. And on writing, since you probably
don't want it, use utf-8.
As for guessing what encoding was used, the best approach is to ask the
person/program that wrote the file. Or read the specs. And once you
figure it out, fix the specs.
With a short sample, you're unlikely to guess right. That's because
ASCII looks the same in all the byte-encoded formats. (Not in the
various *16* and *32* formats, as they use 2 bytes or 4 bytes each) If
you encounter one of those, you'll probably see lots of null bytes mixed
in a consistent pattern.
Consider the 'file' command in Linux. I don't know of any Windows
equivalent.
If you want to write your own utility, perhaps to scan hundreds of
files, consider:
http://pypi.python.org/pypi/chardet
http://linux.die.net/man/3/libmagic
https://github.com/ahupp/python-magic
--
--
DaveA
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor