-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Bicking wrote:
>> IOW, the bytes/string discussion on Python-dev has kind of led me to >> realize that we might just as well make the *entire* stack bytes (incoming >> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about >> using str on "Python 3000" to say we go with bytes on Python 3+ for >> everything that's a str in today's WSGI. >> > > This was my first intuition too, until I started thinking in more detail > about the particular values involved. Some obviously are textish, like > environ['SERVER_NAME']. Not a very useful value, but definitely text. > > Basically all the internal strings are textish, so we're left with: What do you mean by "internal"? Anything in the headers or the CGI environment is intrinsically "bytes-ish" to me. Do you mean that you want application programmers to have them transparently decoded? If so, we can make that the responsibility of the non-middleware framework / application. > wsgi.url_scheme > SCRIPT_NAME/PATH_INFO > QUERY_STRING > HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers) > response status > response headers (name and value) > > And there's a few things like REMOTE_USER that are kind of in the middle. > Everyone is in agreement that bodies should be bytes. > > One initial problem is that the Python 3 stdlib handles bytes poorly, so for > instance there's no good way to reconstruct the URL using the stdlib. That > explains certain tensions, but I think we should ignore that, and in fact > that's what Python-Dev seemed to say pretty clearly. python-dev seems to me to be coming to the realization that they should have tried harder to make real-world apps work before they froze their choices. > Now, the other keys: > > wsgi.url_scheme: clearly ASCII > > SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old > legacy encoding. > raw request path: should be ASCII (non-ASCII should be URL-encoded). URL > encoding happens at the byte layer, so a server could reasonably URL encode > any non-ASCII characters without imposing any encoding. > > QUERY_STRING: should be ASCII, same as raw request path > > headers: Most are ASCII. Latin1 is a reasonable fallback and suggested by > the specification. The spec also implies you have use the RFC2047 inline > encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and > supporting it would probably be a bad idea for security reasons. The > Atompub spec (reasonably modern) specifically says Title headers should be > encoded with RFC2047 (if they are not ISO-8859-1): > http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- > decoding this kind of encoding at the application layer seems reasonable to > me. > > cookie header: this specific header can easily have multiple encodings, as > the browser encodes data then treats it as opaque bytes, so a cookie can be > set via UTF-8 one place, Latin1 another, and those coexist in one header. > That is, there is no real encoding and this should be treated as bytes. > (Latin1 is an approximation of bytes... a spotty way to treat bytes, but > entirely workable.) > > response status: I believe the spec says this must be Latin1/ISO-8859-1. In > practice it is almost always ASCII, and since it is not user-visible it's > not something that really needs localization. > > response headers: the spec implies Latin1, in practice the Set-Cookie header > is bytes (since interoperation with wonky legacy systems is not uncommon). > I'm not sure of any other exceptions? > > > So... to me it seems pretty reasonable for HTTP specifically that text can > work. And if feels weird that, say, environ['SERVER_NAME'] be text and > environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] > should be in that mode. And it would also be weird if > environ['SERVER_NAME'] was bytes. > In the past when we've gotten down to specifics, the only holdup has been > SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those. I think I favor PJE's suggestion: let WSGI deal only in bytes. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 [email protected] Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkxA03wACgkQ+gerLs4ltQ7x0gCg03P1cT9RsJhagBERqY6SbLQ8 zu0An0T0YoFjzAb+2WjWp20DS3VeP68u =ybUr -----END PGP SIGNATURE----- _______________________________________________ Web-SIG mailing list [email protected] Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
