2009/8/12 Ian Bicking <i...@colorstudy.com>: > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fuman...@aminus.org> wrote: >> >> > 5. When running under Python 3, servers MUST provide CGI HTTP and >> > server variables as strings. Where such values are sourced from a byte >> > string, be that a Python byte string or C string, they should be >> > converted as 'UTF-8'. If a specific web server infrastructure is able >> > to support different encodings, then the WSGI adapter MAY provide a >> > way for a user of the WSGI adapter to customise on a global basis, or >> > on a per value basis what encoding is used, but this is entirely >> > optional. Note that there is no requirement to deal with RFC 2047. >> >> We're passing unicode for almost everything. >> >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom >> ACTUAL_SERVER_PROTOCOL entries. >> >> The original bytes of the Request-URI are stored in REQUEST_URI. However, >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a >> configurable charset, defaulting to UTF-8. If the path cannot be decoded >> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if >> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated >> it, we would make it decoded by the same charset. > > My understanding is that PATH_INFO *should* be UTF-8 regardless of what > encoding a page might be in. At least that's what I got when testing > Firefox. It might not be valid UTF-8 if it was manually constructed, but > then there's little reason to think it is valid anything; only the bytes or > REQUEST_URI are likely to be an accurate representation. (Frankly I wish > PATH_INFO was not url-decoded, which would remove this issue entirely -- > REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't > know of reasonable cases where this wouldn't be true.) > I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be > used to kind of reconstruct the original request path (the surrogateescape > or whatever it is called would serve the same purpose, but is only available > in Python 3).
Thinking about it for a while, I get the feel that having a fallback to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That URLs wouldn't consistently use the same encoding all the time just seems wrong. I would see it as returning a bad request status. If an application coder knows they are actually going to be dealing with latin-1, as that is how the application is written, then they should be specifying it should be latin-1 always instead of utf-8. Thus, the WSGI adapter should provide a means to override what encoding is used. For simple WSGI adapters which only service one WGSI application, then it would apply to whole URL namespace. For something like Apache where could map to multiple WSGI applications, then it may want to provide means of overriding encoding for specific subsets o URLs, ie., using Location directive for example. Graham _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com