Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

Ian Bicking Wed, 12 Nov 2008 15:25:45 -0800

Andrew Clover wrote:

If we could reliably read the bytes the browser sends to us in the GETrequest that would be great, we could just decode those and be done withit. Unfortunately, that's not reliable, because:
1. thanks to an old wart in the CGI specification, %XX hex escapes aredecoded before the character is put into the PATH_INFO environmentvariable;

I don't see a problem with this? At least not a problem with respect toencoding. As it is (in Python 2), you should do something likeenviron['PATH_INFO'].decode('utf8') and it should work. It doesn't seemlike there's any distinction between %-encoded characters and plaincharacters in this situation.

2. the environment variables may be stored as Unicode.
(1) on its own gives us the problem of not being able to distinguish apath-separator slash from an encoded %2F; a long-known problem but notone that greatly affects most people.
But combined with (2) that means some other component must choose how todecode the bytes into Unicode characters. No standard currentlyspecifies what encoding to use, it is not typically configuarable, andit's certainly not within reach of the WSGI application. My assumptionis that most applications will want to end up with UTF-8-encoded URLs;other choices are certainly possible but as we move towards IRI theybecome less likely.
This situation previously affected only Windows users, because NTenvironment variables are native Unicode. However, Python 3.0 specifiesall environment variable access is through a Unicode wrapper, and givesno way to control how that automatic decoding is done, leaving everyonein the same boat.
WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environshould be "decoded from the headers using HTTP standard encodings (i.e.latin-1 + RFC 2047)", but unfortunately this doesn't quite work:

My understanding of this suggestion is that latin-1 is a way ofrepresenting bytes as unicode. In other words, the values will beunicode, but that will simply be a lie. So if you know you have UTF8paths, you'd do:


path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')

As far as I can tell this is simply to avoid having bytes in theenvironment, even though bytes are an accurate representation andunicode is not.

A lot of what you write about has to do with CGI, which is the onlyplace WSGI interacts with os.environ. CGI is really an aspect of theCGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGIspec itself.

Personally I'm more inclined to set up a policy on the WSGI serveritself with respect to the encoding, and then use real unicodecharacters. Unfortunately that's not as flexible as bytes, as itdoesn't make it very easy to sniff out the encoding inapplication-specific ways, or support different encodings in differentparts of the server (which would be useful if, for instance, you were toproxy applications with unknown encodings). So... maybe that's not themost feasible option. But if it's not, then I'd rather stick with bytes.



--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
_______________________________________________
Web-SIG mailing list
[email protected]
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

Reply via email to