On Saturday, July 17, 2010, Ian Bicking <i...@colorstudy.com> wrote:
> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <p...@telecommunity.com> wrote:
>
>
> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>
> And this doesn't help with Python 3: either we have byte values of 
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think 
> bytes will be more awkward to port to than text, and inconsistent with other 
> WSGI values.
>
>
> OTOH, it has the tremendous advantage of pushing the encoding question onto 
> the app (or framework) developer...  who's really the only one who can make 
> the right decision for their particular application.  And personally, I'd 
> rather have clear boundaries between text and bytes, such that porting (even 
> if tedious or awkward) is *consistent*, and clear as to when you're finished, 
> not, "oh, did I check to make sure I converted SCRIPT_NAME and PATH_INFO...  
> not just in my app code, but in all the library code I call *from* my app?"
>
> IOW, the bytes/string discussion on Python-dev has kind of led me to realize 
> that we might just as well make the *entire* stack bytes (incoming and 
> outgoing headers *and* streams), and rewrite that bit in PEP 333 about using 
> str on "Python 3000" to say we go with bytes on Python 3+ for everything 
> that's a str in today's WSGI.
>
> This was my first intuition too, until I started thinking in more detail 
> about the particular values involved.  Some obviously are textish, like 
> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>
> Basically all the internal strings are textish, so we're left with:
>
> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
>
> And there's a few things like REMOTE_USER that are kind of in the middle.  
> Everyone is in agreement that bodies should be bytes.
>
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for 
> instance there's no good way to reconstruct the URL using the stdlib.  That 
> explains certain tensions, but I think we should ignore that, and in fact 
> that's what Python-Dev seemed to say pretty clearly.
>
> Now, the other keys:
>
> wsgi.url_scheme: clearly ASCII
>
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old 
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL 
> encoding happens at the byte layer, so a server could reasonably URL encode 
> any non-ASCII characters without imposing any  encoding.
>
> QUERY_STRING: should be ASCII, same as raw request path
>
> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by 
> the specification.  The spec also implies you have use the RFC2047 inline 
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and 
> supporting it would probably be a bad idea for security reasons.  The Atompub 
> spec (reasonably modern) specifically says Title headers should be encoded 
> with RFC2047 (if they are not ISO-8859-1): 
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- decoding 
> this kind of encoding at the application layer seems reasonable to me.
>
> cookie header: this specific header can easily have multiple encodings, as 
> the browser encodes data then treats it as opaque bytes, so a cookie can be 
> set via UTF-8 one place, Latin1 another, and those coexist in one header.  
> That is, there is no real encoding and this should be treated as bytes.  
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but 
> entirely workable.)
>
> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In 
> practice it is almost always ASCII, and since it is not user-visible it's not 
> something that really needs localization.
>
> response headers: the spec implies Latin1, in practice the Set-Cookie header 
> is bytes (since interoperation with wonky legacy systems is not uncommon).  
> I'm not sure of any other exceptions?
>
>
> So... to me it seems pretty reasonable for HTTP specifically that text can 
> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and 
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] should 
> be in that mode.  And it would also be weird if environ['SERVER_NAME'] was 
> bytes.
>
> In the past when we've gotten down to specifics, the only holdup has been 
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

There were a few other weird ones which are though server specific.
For example PATH_TRANSLATED (??). These are ones where again the
server or operating system dictates the encoding due to them having
bits in them deriving from things like filesystem paths and server
configuration files. I laboriously went through all these in an email
last year or earlier.

Same reason why SCRIPT_NAME is really dictated by server and raw value
perhaps should be going through to application.

Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to