Duncan Booth schrieb: > The way that uri encoding is supposed to work is that first the input > string in unicode is encoded to UTF-8 and then each byte which is not in > the permitted range for characters is encoded as % followed by two hex > characters.
Can you back up this claim ("is supposed to work") by reference to a specification (ideally, chapter and verse)? In URIs, it is entirely unspecified what the encoding is of non-ASCII characters, and whether % escapes denote characters in the first place. > Unfortunately RFC3986 isn't entirely clear-cut on this issue: > >> When a new URI scheme defines a component that represents textual >> data consisting of characters from the Universal Character Set [UCS], >> the data should first be encoded as octets according to the UTF-8 >> character encoding [STD63]; then only those octets that do not >> correspond to characters in the unreserved set should be percent- >> encoded. For example, the character A would be represented as "A", >> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented >> as "%C3%80", and the character KATAKANA LETTER A would be represented >> as "%E3%82%A2". This is irrelevant, it talks about new URI schemes only. > I think it leaves open the possibility that existing URI schemes which do > not support unicode characters can use other encodings, but given that the > original posting started by decoding a unicode string I think that utf-8 > should definitely be assumed in this case. No, the http scheme is defined by RFC 2616 instead. It doesn't really talk about encodings, but hints an interpretation in 3.2.3: # When comparing two URIs to decide if they match or not, a client # SHOULD use a case-sensitive octet-by-octet comparison of the entire # URIs, [...] # Characters other than those in the "reserved" and "unsafe" sets (see # RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding. Now, RFC 2396 already says that URIs are sequences of characters, not sequences of octets, yet RFC 2616 fails to recognize that issue and refuses to specify a character set for its scheme (which RFC 2396 says that it could). The conventional wisdom is that the choice of URI encoding for HTTP is a server-side decision; for that reason, IRIs were introduced. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list