Martin v. Löwis: > Eventually, the primary string type should be the Unicode > string. If you are curious how far we are still off that goal, > just try running your program with the -U option.
Tried both -U and sys.setdefaultencoding("undefined") on a couple of my most used programs and saw a few library problems. One program reads job advertisements from a mailing list, ranks them according to keywords, and displays them using unicode to ensure that HTML entities like • are displayed correctly. That program worked without changes. The second program reads my spam filled mail box removing messages that match a set of header criteria. It uses decode_header and make_header from the email.Header library module to convert each header from a set of encoded strings into a single unicode string. As email.Header is strongly concerned with unicode, I expected it would be able to handle the two modifications well. With -U, there was one bug in my code assuming that a string would be 8 bit and that was easily fixed. In email.Charset, __init__ expects a non-unicode argument as it immediately calls unicode(input_charset, 'ascii') which fails when the argument is unicode. This can be fixed explicitly in the __init__ but I would argue for a more lenient approach with unicode(u, enc, err) always ignoring the enc and err arguments when the input is already in unicode. Next sre breaks when building a mapping array because array.array can not have a unicode type code. This should probably be fixed in array rather than sre as mapping = array.array('b'.encode('ascii'), mapping).tostring() is too ugly. The final issue was in encodings.idna where there is ace_prefix = "xn--"; uace_prefix = unicode(ace_prefix, "ascii") which again could avoid breakage if unicode was more lenient. With sys.setdefaultencoding("undefined"), there were more problems and they were harder to work around. One addition that could help would be a function similar to str but with an optional encoding that would be used when the input failed to convert to string because of a UnicodeError. Something like def stri(x, enc='us-ascii'): try: return str(x) except UnicodeError: return unicode(x).encode(enc) Neil -- http://mail.python.org/mailman/listinfo/python-list