Graham Dumpleton wrote:
> So, for WSGI 1.0 style of interface and Python 3.0, the following is
> what I was going to implement.

FWIW, I'll answer with what we've implemented for CherryPy 3.2.

> 1. When running under Python 3, applications SHOULD produce bytes
> output, status line and headers.

Yup.

> 2. When running under Python 3, servers and gateways MUST accept
> strings for output, status line and headers. Such strings must be
> converted to bytes output using 'latin-1'. If string cannot be
> converted then is treated as an error.

Yes.

> 3. When running under Python 3, servers MUST provide wsgi.input as a
> binary (byte) input stream.

Boy howdy.

> 4. When running under Python 3, servers MUST provide a text stream for
> wsgi.errors. In converting this to a byte stream for writing to a
> file, the default encoding would be applied.

I'll look into it.

> 5. When running under Python 3, servers MUST provide CGI HTTP and
> server variables as strings. Where such values are sourced from a byte
> string, be that a Python byte string or C string, they should be
> converted as 'UTF-8'. If a specific web server infrastructure is able
> to support different encodings, then the WSGI adapter MAY provide a
> way for a user of the WSGI adapter to customise on a global basis, or
> on a per value basis what encoding is used, but this is entirely
> optional. Note that there is no requirement to deal with RFC 2047.

We're passing unicode for almost everything.

REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must 
be ascii-decodable. So are SERVER_PROTOCOL and our custom 
ACTUAL_SERVER_PROTOCOL entries.

The original bytes of the Request-URI are stored in REQUEST_URI. However, 
PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable 
charset, defaulting to UTF-8. If the path cannot be decoded with that charset, 
ISO-8859-1 is tried. Whichever is successful is stored at 
environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. 
Our origin server always sets SCRIPT_NAME to '', but if we populated it, we 
would make it decoded by the same charset.

All request headers are decoded via ISO-8859-1, which can't fail. Applications 
are expected to transcode these values if they believe them to be in another 
encoding.

> This is where I am going to diverge from what has been discussed before.
> 
> The reason I am going to pass as UTF-8 and not latin-1 is that it
> looks like Apache effectively only supports use of UTF-8. Since this
> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
> even CGI likely cannot handle anything besides UTF-8 then I really
> can't see the point of trying to cater for a theoretical possibility
> that some HTTP client could use something besides UTF-8. In other
> words, the predominant case will be UTF-8, so let us target that.

That is predominant for the Request-URI, and we are defaulting to utf-8 for 
that as I mentioned above. I believe I demonstrated in 
http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 
cannot be the predominant encoding for request headers, which are instead 
mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to 
ISO-8859-1.

> So, rather than burden every WSGI application with the need to convert
> from latin-1 back to bytes and then to UTF-8, let the server deal with
> it, with server using sensible default, and where server
> infrastructure can handle a different encoding, then it can provide
> option to use that encoding and WSGI application doesn't need to
> change.

If there are indeed more headers which are ISO-8859-1, then that same argument 
cuts both ways.

I have no problem doing the same thing here as we do for PATH_INFO: a 
configurable charset, or better yet a list of charsets to try in order, with a 
sensible default, even UTF-8 would be fine. Regardless of the default, if it is 
configurable, then the successful encoding should be put in a canonical environ 
entry so apps can transcode it if the server got it wrong.

Re:bytes. We really do not want the server to set any of the above environ 
entries (except REQUEST_URI) to bytes. I'm surprised those of you who have 
substantial numbers of WSGI middleware aren't fighting this; it would mean 
decoding the same environ entries every time you switched middleware providers. 
Some of you said as much at PyCon: 
http://mail.python.org/pipermail/web-sig/2009-March/003701.html


Robert Brewer
fuman...@aminus.org
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to