The quoting seems to be all mangled here, so please excuse me if I misattribute quotes to the wrong person:
On Thu, Apr 23, 2015 at 04:15:39PM -0700, Jim Mooney wrote: > So is there any way to sniff the encoding, including the BOM (which appears > to be used or not used randomly for utf-8), so you can then use the proper > encoding, or do you wander in the wilderness? > > Pretty much guesswork. There is no foolproof way to guess encodings, since you might happen to have text which *looks* like it starts with a BOM but is actually not. E.g. if you happen to have text which starts with þÿ then it might be identified as a BOM even though the author intended it to actually be þÿ in the Latin-1 encoding. This is no different from any other file format. All files are made from bytes, and there is no such thing as "JPEG bytes" and "TXT bytes" and "MP3 bytes", they're all the same bytes. In principle, you could have a file which was a valid JPG, ZIP and WAV file *at the same time* (although not likely by chance). And if not those specific three, then pick some other combination. *But*, while it is true that in principle you cannot guess the encoding of files, in practice you often can guess quite successfully. Checking for a BOM is easy. For example, you could use these two functions: def guess_encoding_from_bom(filename, default='undefined'): with open(filename, 'rb') as f: sig = f.read(4) return bom2enc(sig, default) # Untested. def bom2enc(bom, default=None): if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): return 'utf_32' elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')): return 'utf_16' elif bom.startswith(b'\xEF\xBB\xBF'): return 'utf_8_sig' elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39': return 'utf_7' elif bom.startswith(b'\xF7\x64\x4C'): return 'utf_1' elif default is None: raise ValueError else: return default If you have a BOM, chances are very good that the text is encoded correctly. If not, you can try decoding, and if it fails, try again: for encoding in ('utf-8', 'utf-16le', 'utf-16be'): try: with open("filename", "r", encoding=encoding) as f: return f.read() except UnicodeDecodingError: pass raise UnicodeDecodingError("I give up!") You can use chardet to guess the encoding. That's a Python port of the algorithm used by Firefox to guess the encoding of webpages when the declaration is missing: https://pypi.python.org/pypi/chardet chardet works by statistically analysing the characters in the text and tries to pick an encoding that minimizes the number of gremlins. Or you can try fixing errors after they occur: http://blog.luminoso.com/2012/08/20/fix-unicode-mistakes-with-python/ > This all sounds suspiciously like the old browser wars I suffered while > webmastering. I'd almost think Microsoft had a hand in it ;') Ha! In a way they did, because Microsoft Windows code pages are legacy encodings. But given that there were no alternatives back in the 1980s, I can't blame Microsoft for the mess in the past. (Besides, MS are one of the major sponsors of Unicode, so they are helping to fix the problem too.) > If utf-8 can > handle a million characters, why isn't it just the standard? I doubt we'd > need more unless the Martians land. UTF-8 is an encoding, not a character set. The character set tells us what characters we can use: D ñ Ƕ λ Ъ ᛓ ᾩ ‰ ℕ ℜ ↣ ⊂ are all legal in Unicode, but not in ASCII, Latin-1, ISO-8859-7, BIG-5, or any of the dozens of other character sets. The encoding tells us how to turn a character like A or λ into one or more bytes to be stored on disk, or transmitted over a network. In the non-Unicode legacy encodings, we often equate the character set with the encoding (or codec), e.g. we say that ASCII A *is* byte 65 (decimal) or 41 (hex). But that's sloppy language. The ASCII character set includes the character A (but not λ) while the encoding tells us character A gets stored as byte 41 (hex). To the best of my knowledge, Unicode is the first character set which allows for more than one encoding. Anyway, Unicode is *the* single most common character set these days, more common than even ASCII and Latin-1. About 85% of webpages use UTF-8. > Since I am reading opinions that the BOM doesn't even belong in utf-8, can > I assume just using utf-8-sig as my default encoding, even on a non-BOM > file, would do no harm? Apparently so. It looks like utf_8-sig just ignores the sig if it is present, and uses UTF-8 whether the signature is present or not. That surprises me. -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor