Antoine Pitrou <pit...@free.fr> added the comment: > Now one of the major goals of Python 2.6/2.7 is to allow the writing > of code which ports smoothly to Python 3. Unicode support is a major > issue here.
I understand the argument. But 2.7 is a bugfix branch and shouldn't receive new features, even backports. If we wanted 2.x to converge further into 3.x, we would do a 2.8, which we have decided not to do. > I don't consider use of Unicode strings in Python 2.7 to be > "accidental". In my experience with Python 2, pretty much everything > already works with Unicode strings, and it's best practice to use > them. Not true. From the urllib module itself: $ touch /tmp/hé $ python -c 'import urllib; urllib.urlretrieve("file:///tmp/hé")' $ python -c 'import urllib; urllib.urlretrieve(u"file:///tmp/hé")' Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib64/python2.6/urllib.py", line 93, in urlretrieve return _urlopener.retrieve(url, filename, reporthook, data) File "/usr/lib64/python2.6/urllib.py", line 225, in retrieve url = unwrap(toBytes(url)) File "/usr/lib64/python2.6/urllib.py", line 1027, in toBytes " contains non-ASCII characters") UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters > Having functions in Python 2.7 which don't accept Unicode (or worse, > raise random exceptions) runs against best practices for moving to > Python 3. There are lots of them, and urllib.quote() isn't an exception: 'x\x9c\xcbH\x04\x00\x013\x00\xca' >>> zlib.compress(u"hà") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128) pwd.struct_passwd(pw_name='root', pw_passwd='x', pw_uid=0, pw_gid=0, pw_gecos='root', pw_dir='/root', pw_shell='/bin/bash') >>> pwd.getpwnam(u"rooté") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128) > In fact, most code written to work with strings naturally works with > Unicode because unicode strings support the same basic operations. What should zlib compression of an unicode string result in? > > The original issue is against robotparser, and clearly states a bug > > (robotparser doesn't work in some cases). > > I don't know why this keeps coming back to robotparser. The original > bug was not against robotparser; it is called "quote throws exception > on Unicode URL" and that is the bug. Robotparser was just one > demonstrative piece of code which failed because of it. Well, there are two different concerns: - robotparser fails on certain Web pages, which is a bug (unless the Web pages are clearly malformed) - urllib.quote() should accept any kind of unicode strings, and perform appropriate encoding, with an ability to override default encoding parameters: this is a feature request The OP himself (John Nagle) said: “The problem is down inside a library module. "robotparser" is calling "urllib.quote". One of those two library modules needs to be fixed.” It seems to imply that the primary concern was robotparser not working. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue1712522> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com