I don't understand the behavior of the interpreter in Python 3.0. I am
working at a command prompt in Windows (US English), which has a terminal
encoding of cp437.
In Python 2.5:
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> x=u'\u5000'
>>> x
u'\u5000'
In Python 3.0:
Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x='\u5000'
>>> x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\dev\python30\lib\io.py", line 1486, in write
b = encoder.encode(s)
File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u5000' in
position
1: character maps to <undefined>
Where I would have expected
>>> x
'\u5000'
Shouldn't a repr() of x work regardless of output encoding? Another test:
Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes(range(256)).decode('cp437')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>[EMAIL PROTECTED]
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7fÇüéâäàåçêëèïîìÄÅ
ÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■\xa0'
>>> bytes(range(256)).decode('cp437')[255]
'\xa0'
Characters that cannot be displayed in cp437 are being escaped, such as
0x00-0x1F, 0x7F, and 0xA0. Even if I incorrectly decode a value, if the
character exists in cp437, it is displayed:
>>> bytes(range(256)).decode('latin-1')[255]
'ÿ'
However, for a character that isn't supported by cp437, incorrectly decoded:
>>> bytes(range(256)).decode('cp437')[254]
'■'
>>> bytes(range(256)).decode('latin-1')[254]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\dev\python30\lib\io.py", line 1486, in write
b = encoder.encode(s)
File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in
position 1:
character maps to <undefined>
Why not display '\xfe' here? It seems like this inconsistency would make it
difficult to write things like doctests that weren't dependent on the
tester's terminal. It also makes it difficult to inspect variables without
hex(ord(n)) on a character-by-character basis. Maybe repr() should always
display the ASCII representation with escapes for all other characters,
especially considering the "repr() should produce output suitable for eval()
when possible" rule.
What are others' opinions? Any insight to this design decision?
-Mark
--
http://mail.python.org/mailman/listinfo/python-list