On Sat, 17 Nov 2012 08:56:46 -0800, buck wrote: >> Given that the only differences between the two are for code points >> which are in the C1 range (0x80-0x9F), which should never occur in HTML, >> parsing ISO-8859-1 as Windows-1252 should be harmless. > > "should" is a wish. The reality is that documents (and especially URLs) > exist that can be decoded with latin1, but will backtrace with cp1252.
In which case, they're probably neither ISO-8859-1 nor Windows-1252, but some other (unknown) encoding which has acquired the ISO-8859-1 label "by default". In that situation, if you still need to know the encoding, you need to resort to heuristics such as those employed by the chardet library. -- http://mail.python.org/mailman/listinfo/python-list