Matt Giuca writes: > OK, for all the people who say URI encoding does not encode characters: yes > it does. This is not an encoding for binary data, it's an encoding for > character data, but it's unspecified how the strings map to octets before > being percent-encoded.
In other words, it's an encoding for binary data, since the octet sequences that might be encountered are completely unrestricted. I have to side with Bill on this. URIs are sequences of characters, but the character set used must contain the ASCII repertoire as a subset, of which the URI delimiters must be mapped to the corresponding ASCII codes, the rest of the set must be represented as sequences of octets (which need not even be constant; you could gzip them first for all URI-encoding cares). URI-encoding itself is a purely mechanical process which transforms reserved octets (not used as delimiters) to percent codes. > From RFC 3986, section > 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>: > > Percent-encoded octets (Section 2.1) may be used within a URI to represent > > characters outside the range of the US-ASCII coded character set if this > > representation is allowed by the scheme or by the protocol element in which > > the URI is referenced. Such a definition should specify the character > > encoding used to map those characters to octets prior to being > > percent-encoded for the URI. This is kinda perverted, but suppose you have bytes which are actually a Japanese string represented in packed EUC-JP. AFAICS the paragraph above does *not* say you can't transcode to UTF-8 before percent-encoding, and in fact you might be required to by the definition of the scheme. > So the string->string proposal is actually correct behaviour. Ye-e-es, but. What the RFC clearly envisions is not that the percent-encoder will be handed an unencoded string that looks like a URI, but rather a sequence of octets representing one component (scheme, authority, path, query, etc) of a URI. In other words, a string->string URI encoder should only be called by an URI builder, and never with a precomposed URI-like string. Something like def URIBuilder (strings): """Return an URI built from a list of strings. The first string *must* be the scheme. If the URI follows the generic URI syntax of RFC 3986, the remaining components should be given in the order authority, path, fragment, query part [, query part ...].""" def uriencode (s): """URI encode a string per RFC 3986 Section 3.""" # We all know what this does. if strings[0] == "http": # HTTP scheme, delimiters and authority uri = "http://" + uriencode(strings[1]) + "/" # path, if present if strings[2]: uri = uri + uriencode(strings[2]) # query, if present if strings[4]: uri = uri + "?" + uriencode(strings[4]) # further query parameters, if present for s in strings[4:] uri = uri + ";" + uriencode(s) # fragment, if present if strings[3]: uri = uri + "#" + uriencode(strings[3]) else if strings[0] == "mailto": uri = "mailto:" + uriencode(strings[1]) # etc etc return uri I think you'd have a much easier time enforcing this pedantically correct usage with a bytes->bytes encoder. Of course, it's un-Pythonic to enforce pedantry, and we pedants can use a string->string encoder correctly. > You really want me to remove the encoding= named argument? And hard-code > UTF-8 into these functions? A quoting function that accepts bytes *must* have an encoding argument. There's no point to passing the quoter bytes unless the text is represented in a non-Unicode encoding. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com