Just to narrow in on one case, URLs, there are a few pieces of information that make up the URL:
wsgi.url_scheme: this is *not* present in the request, it's inferred somehow (e.g., by the port the client connected to) HTTP_HOST: this is a header. It typically contains both the hostname and the port. The encoding is generally idna, though you have to split the port off first. The unicode version of the hostname is not widely supported in client libraries (it's usually applied at the UI level). SCRIPT_NAME/PATH_INFO: these represent a portion of the request path (before ?). As submitted these are generally ASCII (URL-quoted). After unquoting, they are typically UTF-8, but may be of any or no encoding. If an unsafe character is present in the URL-quoted version of the path, it may be quoted at the byte level. The '?' character is effectively a byte-oriented marker and encodings cannot affect it. QUERY_STRING: this is also generally ASCII (URL-quoted). Unsafe characters could be quoted at the byte level. Generally I'm unaware of any reasonable situation where quoting unsafe characters in an HTTP request would be improper, or even lose any meaningful information. Mostly because I don't know of any clients that actually would expect unsafe characters to work. Quoting HTTP_HOST is difficult, as it's not a byte-oriented quoting, it's a fairly complex encoding. But I'm also not sure where in a stack you could actually handle unsafe characters in HTTP_HOST -- it seems like simply an invalid request, and deferring the error won't give another part of the stack the opportunity to do the right thing. In their quoted form all these values (at least including the quoted path, not the unquoted SCRIPT_NAME/PATH_INFO) *should* be ASCII, and I believe a WSGI server could ensure they were all ASCII without any loss of useful information (either by simply rejecting the request or by applying quoting). I don't see any place where bytes are advantageous. Representing invalid requests does not seem particularly helpful -- *some* invalid requests are useful to handle (e.g., weird cookies) but in the case of the URL variables I don't see any benefit. IMHO all the tricky encoding issues are in the request and response bodies, and I'm pretty sure we have consensus that those should be bytes. Reiterating other encoding issues I'm aware of: Cookie encodings, but parsing cookies as bytes or Latin1 is basically equivalent, and I don't believe that, for instance, they should ever be parsed as UTF-8. Parsing as bytes might avoid an unnecessary encoding/decoding, but it's all tricky enough that libraries should do it anyway, and the encoding overhead alone isn't very important. Another example is the Atom Title header ( http://bitworking.org/projects/atom/draft-ietf-atompub-protocol-08.html#rfc.section.8.1.2) but that's supposed to be Latin1 with RFC2047 encodings, and I don't believe anyone is proposing that RFC2047 encodings be handled generally at the WSGI layer (I think CherryPy does or used to handle these, but there were many objections at least on this list about it, in part due to security concerns). A 2047 encoding is like "Title: =?utf-8?q?stuff-with=-escaping?=". Response headers are equivalent to request headers. Response status is constrained by the spec to Latin1, and there are no use cases I know of (even really obscure ones) where it would be necessary to use other encodings. And that's it! HTTP has a fairly finite amount of surface area. -- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com