On Friday, November 16, 2012 4:33:14 PM UTC-8, Nobody wrote: > On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote: > IOW: Microsoft's "embrace, extend, extinguish" strategy has been too > successful and now we have to deal with it. If HTML content is tagged as > using ISO-8859-1, it's more likely that it's actually Windows-1252 content > generated by someone who doesn't know the difference.
Yes that's exactly what it says. > Given that the only differences between the two are for code points which > are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing > ISO-8859-1 as Windows-1252 should be harmless. "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here: http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt and here: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf > There are 65 code points set aside in the Unicode Standard for compatibility > with the C0 > and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of > these code > points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to > the 8-bit > controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 > controls), > respectively ... There is a simple, one-to-one mapping between 7-bit (and > 8-bit) control > codes and the Unicode control codes: every 7-bit (or 8-bit) control code is > numerically > equal to its corresponding Unicode code point. IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value. This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf) -- http://mail.python.org/mailman/listinfo/python-list