[issue3300] urllib.quote and unquote - Unicode issues

Bill Janssen Thu, 07 Aug 2008 18:34:20 -0700

Bill Janssen <[EMAIL PROTECTED]> added the comment:

Now I'm looking at the failing test_http_cookiejar test, which fails
because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI.
The question is, does the "http" URI scheme allow non-ASCII (say,
Latin-1) octets in path segments?  IANA says that the "http" scheme
is defined in RFC 2616, and that says:


   This specification adopts the
   definitions of "URI-reference", "absoluteURI", "relativeURI", "port",
   "host","abs_path", "rel_path", and "authority" from [RFC 2396].

But RFC 2396 says:

    An individual URI scheme may require a single charset, define a
    default charset, or provide a way to indicate the charset used.

And doesn't say anything about the "http" scheme.  Nor does it indicate
any default encoding or character set for URIs.  The update, 3986,
doesn't say anything new about this, though it does implore URI scheme
designers to represent characters in a textual segment with ASCII codes
where they exist, and to use UTF-8 when designing *new* URI schemes.

Barring any other information, I think that the "segments" in the path
of an "http" URL must also be assumed to be binary; that is, any octet
is allowed, and no character set can be presumed.

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to