[issue22231] httplib: unicode url will cause an ascii codec error when combined with a utf-8 string header

Demian Brecht Fri, 02 Jan 2015 15:23:31 -0800

Demian Brecht added the comment:

A few notes:


1. Unicode hosts are not automatically IDNA-encoded (which they /could/ be 
rather than relying on the programmer to be aware of this), but this really has 
no bearing on this specific issue
2. Unicode paths are not automatically IRI-encoded (see 
https://tools.ietf.org/html/rfc3987#section-3), which should also likely be 
automatically handled when unicode objects are encountered as the path
3. When a single unicode element is contained within a list, string_join will 
defer to PyUnicode_Join.

The problem here is that your pre-joined request elements looks like this: 
[u'POST http://bugs.python.org/any_url HTTP/1.1', 'Host: bugs.python.org', 
'Accept-Encoding: identity', 'Content-Length: 44', 'notes: 
\xe5\x91\xb5\xe5\x91\xb5', 'Content-type: application/x-www-form-urlencoded', 
'Accept: text/plain', '', '']

Because there's a unicode object contained in the list at index 0, the entire 
list is converted to unicode, which results in the error when \xe5 is 
encountered by the ascii decoder.

The proposed solution won't work as unicode characters are legal (see RFC 3987) 
and will fail should anything outside of the ascii character set be present.

I think that the correct way to solve this issue is to automatically encode 
unicode paths (or IRIs) using urllib.quote, passing the reserved characters 
defined in RFC 3987 as the safe parameter:

>>> urllib.quote(u'/foo/呵/bar'.encode('utf-8'),':/?#[]@!$&\'()*+,;=')
'/foo/%E5%91%B5/bar'

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue22231>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22231] httplib: unicode url will cause an ascii codec error when combined with a utf-8 string header

Reply via email to