On Sat, Nov 17, 2012 at 9:56 AM, <b...@yelp.com> wrote: > "should" is a wish. The reality is that documents (and especially URLs) exist > that can be decoded with latin1, but will backtrace with cp1252. I see this > as a sign that a small refactorization of cp1252 is in order. The proposal is > to change those "UNDEFINED" entries to "<control>" entries, as is done here: > > http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt > > and here: > > ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
The README for the "BestFit" document states: """ These tables include "best fit" behavior which is not present in the other files. Examples of best fit are converting fullwidth letters to their counterparts when converting to single byte code pages, and mapping the Infinity character to the number 8. """ This does not sound like appropriate behavior for a generalized conversion scheme. It is also noted that the "BestFit" document is not authoritative at: http://www.iana.org/assignments/charset-reg/windows-1252 > This is in line with the unicode standard, which says: > http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf > >> There are 65 code points set aside in the Unicode Standard for compatibility >> with the C0 >> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of >> these code >> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to >> the 8-bit >> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 >> controls), >> respectively ... There is a simple, one-to-one mapping between 7-bit (and >> 8-bit) control >> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is >> numerically >> equal to its corresponding Unicode code point. > > IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", > which decode to the unicode-point of equal value. > > This is exactly the section which allows latin1 to decode 0x81 to U+81, even > though ISO-8859-1 explicitly does not define semantics for that byte (6.2 > ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf) But Latin-1 explicitly defers to to the control codes for those characters. CP-1252 does not; the reason those characters are left undefined is to allow for future expansion, such as when Microsoft added the Euro sign at 0x80. Since we're talking about conversion from bytes to Unicode, I think the most authoritative source we could possibly reference would be the official ISO 10646 conversion tables for the character sets in question. I understand those are to be found here: http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT and here: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas the cp1252 mapping leaves those five codes undefined. This would seem to indicate that Python is correctly decoding CP-1252 according to the Unicode standard. -- http://mail.python.org/mailman/listinfo/python-list