On 10/2/2010 7:00 PM, R. David Murray wrote: > The clever hack (thanks ultimately to Martin) is to accept 8bit data > by encoding it using the ASCII codec and the surrogateescape error > handler.
I've seen this idea pop up in a number of threads. I worry that you are all inventing a new kind of dual that is a direct parallel to Python 2.x strings. That is to say, 3.x>>> b = b'\xc2\xa1' 3.x>>> s = b.decode('utf8') 3.x>>> v = b.decode('ascii', 'surrogateescape') , where s and v should be the same "thing" in 3.x but they are not due to an encoding trick. I believe this trick generates more-or-less the same issues as strings did in 2.x: 2.x>>> b = '\xc2\xa1' 2.x>>> s = b.decode('utf8') 2.x>>> v = b Any reasonable 2.x code has to guard on str/unicode and it would seem in 3.x, if this idiom spreads, reasonable code will have to guard on surrogate escapes (which actually seems like a more expensive test). As in, 3.x>>> print(v) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 0: surrogates not allowed It seems like this hack is about making the 3.x unicode type more like the 2.x string type, and I thought we decided that was a bad idea. How will developers not have to ask themselves whether a given string is a "real" string or a byte sequence masquerading as a string? Am I missing something here? -- Scott Dial sc...@scottdial.com scod...@cs.indiana.edu _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com