At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote:
1. When running under Python 3, applications SHOULD produce bytes
output, status line and headers.

This is effectively what we had before. The only difference is that
clarify that the 'status line' values should also be bytes. This
wasn't noted before. I had already updated the proposed WSGI 1.0
amendments page to mention this.

+1


2. When running under Python 3, servers and gateways MUST accept
strings for output, status line and headers. Such strings must be
converted to bytes output using 'latin-1'. If string cannot be
converted then is treated as an error.

This is again what we had before except that mention 'status line' value.

3. When running under Python 3, servers MUST provide wsgi.input as a
binary (byte) input stream.

No change here.

4. When running under Python 3, servers MUST provide a text stream for
wsgi.errors. In converting this to a byte stream for writing to a
file, the default encoding would be applied.

No real change here except to clarify that default encoding would
apply. Use of default encoding though could be problematic if
combining different WSGI components. This is because each WSGI
component may have been developed on system with different default
encoding and so one may expect to log characters that can't be written
on a different setup. Not sure how you could solve that except to say
people have default encoding be UTF-8 for portability.

Also +1.


5. When running under Python 3, servers MUST provide CGI HTTP and
server variables as strings. Where such values are sourced from a byte
string, be that a Python byte string or C string, they should be
converted as 'UTF-8'. If a specific web server infrastructure is able
to support different encodings, then the WSGI adapter MAY provide a
way for a user of the WSGI adapter to customise on a global basis, or
on a per value basis what encoding is used, but this is entirely
optional. Note that there is no requirement to deal with RFC 2047.

This is where I am going to diverge from what has been discussed before.

The reason I am going to pass as UTF-8 and not latin-1 is that it
looks like Apache effectively only supports use of UTF-8. Since this
means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
even CGI likely cannot handle anything besides UTF-8 then I really
can't see the point of trying to cater for a theoretical possibility
that some HTTP client could use something besides UTF-8. In other
words, the predominant case will be UTF-8, so let us target that.

So, rather than burden every WSGI application with the need to convert
from latin-1 back to bytes and then to UTF-8, let the server deal with
it, with server using sensible default, and where server
infrastructure can handle a different encoding, then it can provide
option to use that encoding and WSGI application doesn't need to
change.

Maybe I'm missing something here, but what if Apache receives something encoded in Latin-1? AFAIR, form POST encoding is determined by the encoding of the page containing the form; that's of course something that only happens in the input body, but what about URLs?

Mainly I'm wondering, what should the server do in the event they receive a byte string which is not valid UTF-8? (Latin-1 doesn't have this problem, since there's no such thing as an invalid Latin-1 string, at least not at the encoding level.)


Also shown though that SCRIPT_NAME part has to be UTF-8
and we would really be entering fantasy land if you were somehow going
to cope with some different encoding for PATH_INFO and QUERY_STRING.
Instead it is like the GPL, viral in nature. Use of UTF-8 in one
particular area means you are effectively bound to use UTF-8
everywhere else.

I'm not clear on your logic here. If I request foo/bar/baz (where baz actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the script, then the (accented) baz is legitimate for pass-through to the application, no?

I just tried testing this with Firefox and Apache, and found that you can in fact pass such Latin-1 strings through to PATH_INFO, but at least in the case of Firefox, you have to %-escape them. However, they are seen by Python (via os.environ) as latin-1 encoded byte strings.


Further example of why UTF-8 reaches into everything is mod_rewrite
module for Apache. This allows you to do stuff related to SCRIPT_NAME,
PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
configuration file has to be UTF-8. If URL isn't, then wouldn't be
possible to perform matches against non latin-1 characters in a
rewrite condition or rule. This is because your match string would be
in different encoded form to that in URL and so wouldn't match.

Note that this still doesn't have any impact on the bytes that actually reach the application, which can be non-UTF8. At minimum, the proposal is underspecified as to how to handle this case, which is as trivial to generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s) of a URL.

_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to