On Fri, 13 Aug 2010 18:25:46 -0400, Terry Reedy wrote: > A short background to MRAB's answer which I will try to get right. > > The byte-order-mark was invented for UTF-16 encodings so the reader > could determine whether the pairs of bytes are in little or big endiean > order, depending on whether the first two bute are fe and ff or ff and > fe (or maybe vice versa, does not matter here). The concept is > meaningless for utf-8 which consists only of bytes in a defined order. > This is part of the Unicode standard. > > However, Microsoft (or whoever) re-purposed (hijacked) that pair of > bytes to serve as a non-standard indicator of utf-8 versus any > non-unicode encoding. The result is a corrupted utf-8 stream that python > accommodates with the utf-8-sig(nature) codec (versus the standard utf-8 > codec).
Is there a standard way to autodetect the encoding of a text file? I do this: Open the file in binary mode; if the first three bytes are codecs.BOM_UTF8, then it's a Microsoft UTF-8 text file; otherwise if the first two byes are codecs.BOM_BE or codecs.BOM_LE, the encoding is utf-16- be or utf-16-le respectively. (I don't bother to check for other BOMs, such as for utf-32. There are *lots* of them, but in my experience the encodings are rarely used, and the BOMs aren't defined in the codecs module, so I don't bother to support them.) If there's no BOM, then re-open the file and read the first two lines. If either of them match this regex 'coding[=:]\s*([-\w.]+)' then I take the encoding name from that. This matches Python's behaviour, and supports EMACS and vi encoding declarations. Otherwise, there is no declared encoding, and I use whatever encoding I like (whatever was specified by the user or the application default). -- Steven -- http://mail.python.org/mailman/listinfo/python-list