On Sat, 17 Nov 2012 08:56:46 -0800, buck wrote:

>> Given that the only differences between the two are for code points
>> which are in the C1 range (0x80-0x9F), which should never occur in HTML,
>> parsing ISO-8859-1 as Windows-1252 should be harmless.
> 
> "should" is a wish. The reality is that documents (and especially URLs)
> exist that can be decoded with latin1, but will backtrace with cp1252.

In which case, they're probably neither ISO-8859-1 nor Windows-1252, but
some other (unknown) encoding which has acquired the ISO-8859-1 label
"by default".

In that situation, if you still need to know the encoding, you need to
resort to heuristics such as those employed by the chardet library.

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to