I don't understand the behavior of the interpreter in Python 3.0. I am working at a command prompt in Windows (US English), which has a terminal encoding of cp437.

In Python 2.5:

Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win
   32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> x=u'\u5000'
   >>> x
   u'\u5000'

In Python 3.0:

Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit (Intel)]
   on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> x='\u5000'
   >>> x
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "c:\dev\python30\lib\io.py", line 1486, in write
       b = encoder.encode(s)
     File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
       return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u5000' in position
   1: character maps to <undefined>

Where I would have expected

   >>> x
   '\u5000'

Shouldn't a repr() of x work regardless of output encoding?  Another test:

Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit (Intel)]
   on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> bytes(range(256)).decode('cp437')
   
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>[EMAIL PROTECTED]
   
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7fÇüéâäàåçêëèïîìÄÅ
   
ÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
   αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■\xa0'
   >>> bytes(range(256)).decode('cp437')[255]
   '\xa0'

Characters that cannot be displayed in cp437 are being escaped, such as 0x00-0x1F, 0x7F, and 0xA0. Even if I incorrectly decode a value, if the character exists in cp437, it is displayed:

   >>> bytes(range(256)).decode('latin-1')[255]
   'ÿ'

However, for a character that isn't supported by cp437, incorrectly decoded:

   >>> bytes(range(256)).decode('cp437')[254]
   '■'
   >>> bytes(range(256)).decode('latin-1')[254]
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "c:\dev\python30\lib\io.py", line 1486, in write
       b = encoder.encode(s)
     File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
       return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in position 1:
    character maps to <undefined>

Why not display '\xfe' here? It seems like this inconsistency would make it difficult to write things like doctests that weren't dependent on the tester's terminal. It also makes it difficult to inspect variables without hex(ord(n)) on a character-by-character basis. Maybe repr() should always display the ASCII representation with escapes for all other characters, especially considering the "repr() should produce output suitable for eval() when possible" rule.

What are others' opinions?  Any insight to this design decision?

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to