Bill Janssen <[EMAIL PROTECTED]> added the comment: Now I'm looking at the failing test_http_cookiejar test, which fails because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI. The question is, does the "http" URI scheme allow non-ASCII (say, Latin-1) octets in path segments? IANA says that the "http" scheme is defined in RFC 2616, and that says:
This specification adopts the definitions of "URI-reference", "absoluteURI", "relativeURI", "port", "host","abs_path", "rel_path", and "authority" from [RFC 2396]. But RFC 2396 says: An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. And doesn't say anything about the "http" scheme. Nor does it indicate any default encoding or character set for URIs. The update, 3986, doesn't say anything new about this, though it does implore URI scheme designers to represent characters in a textual segment with ASCII codes where they exist, and to use UTF-8 when designing *new* URI schemes. Barring any other information, I think that the "segments" in the path of an "http" URL must also be assumed to be binary; that is, any octet is allowed, and no character set can be presumed. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com