> Basically, urllib.quote and unquote seem not to have been updated since > Python 2.5, and because of this they implicitly perform Latin-1 encoding and > decoding (with respect to percent-encoded characters). I think they should > default to UTF-8 for a number of reasons, including that's what other > software such as web browsers use.
The standard here is RFC 3986, from Jan 2005, which says, ``When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.'' The "unreserved set" consists of the following ASCII characters: ``Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" '' There are a few other wrinkles; it's worth reading section 2.5 carefully. I'd say, treat the incoming data as either Unicode (if it's a Unicode string), or some unknown superset of ASCII (which includes both Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown encoding), and apply the appropriate transformation. Bill _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com