On 04/23/2015 02:14 PM, Jim Mooney wrote:

By relying on the default when you read it, you're making an unspoken
assumption about the encoding of the file.

--
DaveA


So is there any way to sniff the encoding, including the BOM (which appears
to be used or not used randomly for utf-8), so you can then use the proper
encoding, or do you wander in the wilderness? I was going to use encoding =
utf-8 as a suggested default. I noticed it got rid of the bom symbols but
left an extra blank space at the beginning of the stream. Most books leave
unicode to the very end, if they mention the BOM at all (mine is at page
977, which is still a bit off ;')


That's not a regular blank, See the link I mentioned before, and the following sentence:

""" Unfortunately the character U+FEFF had a second purpose as a ZERO WIDTH NO-BREAK SPACE: a character that has no width and doesn’t allow a word to be split. It can e.g. be used to give hints to a ligature algorithm. """

To automatically get rid of that BOM character when reading a file, you use utf-8-sig, rather than utf-8. And on writing, since you probably don't want it, use utf-8.

As for guessing what encoding was used, the best approach is to ask the person/program that wrote the file. Or read the specs. And once you figure it out, fix the specs.

With a short sample, you're unlikely to guess right. That's because ASCII looks the same in all the byte-encoded formats. (Not in the various *16* and *32* formats, as they use 2 bytes or 4 bytes each) If you encounter one of those, you'll probably see lots of null bytes mixed in a consistent pattern.


Consider the 'file' command in Linux. I don't know of any Windows equivalent.

If you want to write your own utility, perhaps to scan hundreds of files, consider:

  http://pypi.python.org/pypi/chardet
  http://linux.die.net/man/3/libmagic
  https://github.com/ahupp/python-magic

--
--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to