On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <p...@telecommunity.com> wrote:
> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote: > >> And this doesn't help with Python 3: either we have byte values of >> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values. I think >> bytes will be more awkward to port to than text, and inconsistent with other >> WSGI values. >> > > OTOH, it has the tremendous advantage of pushing the encoding question onto > the app (or framework) developer... who's really the only one who can make > the right decision for their particular application. And personally, I'd > rather have clear boundaries between text and bytes, such that porting (even > if tedious or awkward) is *consistent*, and clear as to when you're > finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and > PATH_INFO... not just in my app code, but in all the library code I call > *from* my app?" > > IOW, the bytes/string discussion on Python-dev has kind of led me to > realize that we might just as well make the *entire* stack bytes (incoming > and outgoing headers *and* streams), and rewrite that bit in PEP 333 about > using str on "Python 3000" to say we go with bytes on Python 3+ for > everything that's a str in today's WSGI. > This was my first intuition too, until I started thinking in more detail about the particular values involved. Some obviously are textish, like environ['SERVER_NAME']. Not a very useful value, but definitely text. Basically all the internal strings are textish, so we're left with: wsgi.url_scheme SCRIPT_NAME/PATH_INFO QUERY_STRING HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers) response status response headers (name and value) And there's a few things like REMOTE_USER that are kind of in the middle. Everyone is in agreement that bodies should be bytes. One initial problem is that the Python 3 stdlib handles bytes poorly, so for instance there's no good way to reconstruct the URL using the stdlib. That explains certain tensions, but I think we should ignore that, and in fact that's what Python-Dev seemed to say pretty clearly. Now, the other keys: wsgi.url_scheme: clearly ASCII SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old legacy encoding. raw request path: should be ASCII (non-ASCII should be URL-encoded). URL encoding happens at the byte layer, so a server could reasonably URL encode any non-ASCII characters without imposing any encoding. QUERY_STRING: should be ASCII, same as raw request path headers: Most are ASCII. Latin1 is a reasonable fallback and suggested by the specification. The spec also implies you have use the RFC2047 inline encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and supporting it would probably be a bad idea for security reasons. The Atompub spec (reasonably modern) specifically says Title headers should be encoded with RFC2047 (if they are not ISO-8859-1): http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- decoding this kind of encoding at the application layer seems reasonable to me. cookie header: this specific header can easily have multiple encodings, as the browser encodes data then treats it as opaque bytes, so a cookie can be set via UTF-8 one place, Latin1 another, and those coexist in one header. That is, there is no real encoding and this should be treated as bytes. (Latin1 is an approximation of bytes... a spotty way to treat bytes, but entirely workable.) response status: I believe the spec says this must be Latin1/ISO-8859-1. In practice it is almost always ASCII, and since it is not user-visible it's not something that really needs localization. response headers: the spec implies Latin1, in practice the Set-Cookie header is bytes (since interoperation with wonky legacy systems is not uncommon). I'm not sure of any other exceptions? So... to me it seems pretty reasonable for HTTP specifically that text can work. And if feels weird that, say, environ['SERVER_NAME'] be text and environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] should be in that mode. And it would also be weird if environ['SERVER_NAME'] was bytes. In the past when we've gotten down to specifics, the only holdup has been SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those. -- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com