Graham Dumpleton wrote: > So, for WSGI 1.0 style of interface and Python 3.0, the following is > what I was going to implement.
FWIW, I'll answer with what we've implemented for CherryPy 3.2. > 1. When running under Python 3, applications SHOULD produce bytes > output, status line and headers. Yup. > 2. When running under Python 3, servers and gateways MUST accept > strings for output, status line and headers. Such strings must be > converted to bytes output using 'latin-1'. If string cannot be > converted then is treated as an error. Yes. > 3. When running under Python 3, servers MUST provide wsgi.input as a > binary (byte) input stream. Boy howdy. > 4. When running under Python 3, servers MUST provide a text stream for > wsgi.errors. In converting this to a byte stream for writing to a > file, the default encoding would be applied. I'll look into it. > 5. When running under Python 3, servers MUST provide CGI HTTP and > server variables as strings. Where such values are sourced from a byte > string, be that a Python byte string or C string, they should be > converted as 'UTF-8'. If a specific web server infrastructure is able > to support different encodings, then the WSGI adapter MAY provide a > way for a user of the WSGI adapter to customise on a global basis, or > on a per value basis what encoding is used, but this is entirely > optional. Note that there is no requirement to deal with RFC 2047. We're passing unicode for almost everything. REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries. The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset. All request headers are decoded via ISO-8859-1, which can't fail. Applications are expected to transcode these values if they believe them to be in another encoding. > This is where I am going to diverge from what has been discussed before. > > The reason I am going to pass as UTF-8 and not latin-1 is that it > looks like Apache effectively only supports use of UTF-8. Since this > means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and > even CGI likely cannot handle anything besides UTF-8 then I really > can't see the point of trying to cater for a theoretical possibility > that some HTTP client could use something besides UTF-8. In other > words, the predominant case will be UTF-8, so let us target that. That is predominant for the Request-URI, and we are defaulting to utf-8 for that as I mentioned above. I believe I demonstrated in http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 cannot be the predominant encoding for request headers, which are instead mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to ISO-8859-1. > So, rather than burden every WSGI application with the need to convert > from latin-1 back to bytes and then to UTF-8, let the server deal with > it, with server using sensible default, and where server > infrastructure can handle a different encoding, then it can provide > option to use that encoding and WSGI application doesn't need to > change. If there are indeed more headers which are ISO-8859-1, then that same argument cuts both ways. I have no problem doing the same thing here as we do for PATH_INFO: a configurable charset, or better yet a list of charsets to try in order, with a sensible default, even UTF-8 would be fine. Regardless of the default, if it is configurable, then the successful encoding should be put in a canonical environ entry so apps can transcode it if the server got it wrong. Re:bytes. We really do not want the server to set any of the above environ entries (except REQUEST_URI) to bytes. I'm surprised those of you who have substantial numbers of WSGI middleware aren't fighting this; it would mean decoding the same environ entries every time you switched middleware providers. Some of you said as much at PyCon: http://mail.python.org/pipermail/web-sig/2009-March/003701.html Robert Brewer fuman...@aminus.org
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com