Andrew Clover wrote:
If we could reliably read the bytes the browser sends to us in the GET request that would be great, we could just decode those and be done with it. Unfortunately, that's not reliable, because:

1. thanks to an old wart in the CGI specification, %XX hex escapes are decoded before the character is put into the PATH_INFO environment variable;

I don't see a problem with this? At least not a problem with respect to encoding. As it is (in Python 2), you should do something like environ['PATH_INFO'].decode('utf8') and it should work. It doesn't seem like there's any distinction between %-encoded characters and plain characters in this situation.

2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish a path-separator slash from an encoded %2F; a long-known problem but not one that greatly affects most people.

But combined with (2) that means some other component must choose how to decode the bytes into Unicode characters. No standard currently specifies what encoding to use, it is not typically configuarable, and it's certainly not within reach of the WSGI application. My assumption is that most applications will want to end up with UTF-8-encoded URLs; other choices are certainly possible but as we move towards IRI they become less likely.


This situation previously affected only Windows users, because NT environment variables are native Unicode. However, Python 3.0 specifies all environment variable access is through a Unicode wrapper, and gives no way to control how that automatic decoding is done, leaving everyone in the same boat.

WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ should be "decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047)", but unfortunately this doesn't quite work:

My understanding of this suggestion is that latin-1 is a way of representing bytes as unicode. In other words, the values will be unicode, but that will simply be a lie. So if you know you have UTF8 paths, you'd do:

path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')

As far as I can tell this is simply to avoid having bytes in the environment, even though bytes are an accurate representation and unicode is not.

A lot of what you write about has to do with CGI, which is the only place WSGI interacts with os.environ. CGI is really an aspect of the CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI spec itself.

Personally I'm more inclined to set up a policy on the WSGI server itself with respect to the encoding, and then use real unicode characters. Unfortunately that's not as flexible as bytes, as it doesn't make it very easy to sniff out the encoding in application-specific ways, or support different encodings in different parts of the server (which would be useful if, for instance, you were to proxy applications with unknown encodings). So... maybe that's not the most feasible option. But if it's not, then I'd rather stick with bytes.


--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to