latin1 and cp1252 inconsistent?

buck Fri, 16 Nov 2012 13:48:12 -0800

Latin1 has a block of 32 undefined characters.
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five 
undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D


The byte 0x81 decoded with latin gives the unicode 0x81.
Decoding the same byte with windows-1252 yields a stack trace with 
`UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: 
character maps to <undefined>`

This seems inconsistent to me, given that this byte is equally undefined in the 
two standards.

Also, the html5 standard says:

When a user agent [browser] would otherwise use a character encoding given in 
the first column [ISO-8859-1, aka latin1] of the following table to either 
convert content to Unicode characters or convert Unicode characters to bytes, 
it must instead use the encoding given in the cell in the second column of the 
same row [windows-1252, aka cp1252].

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0


The current implementation of windows-1252 isn't usable for this purpose (a 
replacement of latin1), since it will throw an error in cases that latin1 would 
succeed.
-- 
http://mail.python.org/mailman/listinfo/python-list

latin1 and cp1252 inconsistent?

Reply via email to