Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 P.J. Eby wrote: So you better believe that everybody else is going to copy the worst available examples of other people's WSGI code and ignore any documentation associated with it... and then they will expect it to work on your server. ;-) Amen to that, brother Phil! The milion monkeys effect is massively amplified by the ease of cut-and-paste in modern editors (Im my day, we used 'ed' or 'cat' you kids get off my grass!) Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKvCQ7+gerLs4ltQ4RAvh4AJ0ZAkrqDWQKZ1Qecm2X6tYOsqpFYACgkveA JcuYoYhpPgk6fByC7XQ82aI= =LvgU -END PGP SIGNATURE- ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Sep 22, 2009, at 8:28 PM, P.J. Eby wrote: At 05:12 PM 9/22/2009 -0700, Philip Jenvey wrote: Because our request container is a plain, pre-fabricated dict that doesn't permit the lazy behavior. Not quite true; you can always write a library function, get_foo(environ) that does the lazy caching in a private environ key, at the cost of also caching the original value and doing a consistency check. Sure, that's what the Werkzeug et al WSGI 1 wrappers are already doing, I'm referring to the Python 3 WSGI level itself, assuming it returns latin1 decoded native strs. You're talking about a separate process on top of WSGI -- this process becomes an unnecessary roundtrip compared to the WSGI 1 wrappers, as Armin points out. -- Philip Jenvey ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 P.J. Eby p...@telecommunity.com: I'm tending to flip-flop a bit myself For the record, I am doing that as well. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. Thanks, On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
It's not a specific proposal, but here's my opinions on what a proposal should be: On Tue, Sep 22, 2009 at 1:06 AM, Mark Nottingham m...@mnot.net wrote: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications Graham talked about it as bytes/unicode/native, where native is unicode on Python 3 and str on Python 2. For instance, I think there's general consensus (though not really specifically discussed) that environ keys should be native. I think method should be native. 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens Hah, didn't even think about de-punycoding HTTP_HOST. That'd be a blast. I think: * scheme as native * HTTP_HOST as native (no decoding of punycode) * path as native (no URL decoding) - big break with WSGI 1 and CGI, but what the hell. I could easily waffle on this. * query string as native - *should* be ASCII-safe currently. Wow, that was easy! Request headers, which you didn't split out... those I'm not sure. I'd *like* them to be native. But damn, I'm just not sure quite how. surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape isn't so bad. And the headers *should* be ASCII for sane requests, so it's not a horrible compromise. I guess libraries could lazilly transcode, just like they currently lazily decode. But it'd be a bit obnoxious at the library level. Transcoding middleware would be easier, but it adds the question of how to record that the transcoding has taken place. 3. How request headers are made available to WSGI apps Request handlers? I don't understand your terminology. 4. How the request body is made available to to WSGI apps Ugh. wsgi.input could remain. I think at least it should become a file-like interface (i.e., giving an empty string when the content is exausted) and I might even ask that it implement .tell() (.seek() would be nice of course, but optional). If there was some other idea, I think there's room for improvement on wsgi.input and the file interface. wsgi.input should definitely work with bytes only. I believe this is consensus. 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. I believe there is consensus that the response body should remain an iterator that yields bytes. In one way, it'd be nice if we'd just say that status/headers should be ASCII, because that's the reasonable choice. But for proxying or representing HTTP as it is, it's not always the case. And I'm committed to keeping WSGI fully capable of representing arbitrary requests and responses so long as they aren't entirely diabololical. But, an ASCII status is not unreasonable, especially since there's zero semantic meaning to the reason. Which makes native strings perfectly fine. So, headers... Well, Latin1 is easy enough. In theory, or at least particular theories, headers can be Latin1. And you can represent arbitrary bytes that way. So if you want to send crazy stuff to the browser, you can do it that way. And if you want to stick to plain ASCII then that's easy enough as well. So... native? str or unicode? I'm not sure specifically for this one. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. If you aren't willing to read the PEP to understand WSGI why are you even wanting to participate in the discussion in the first place? This is a quite detailed discussion about the future of the WSGI specification and not an IRC channel manned by ticket monkeys. :-( Graham Thanks, On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
You're twisting my words; nowhere did I say i wasn't willing to read the PEP. What I did say was that a proposal can and should be made in less than eleven pages; I'd like to give my feedback, both because I use Python and because I have some interest in HTTP. However, my time is limited, and I already have a stack of other things to review on my desk. He who writes the most words does not (hopefully, for the sake of the Python community) win. I appreciate that you've taken the time to reason out a proposal, but the minutia of how you got to that place should not obscure the proposal itself. I'm not sure how to take your ticket monkeys comment, so I'll ignore it. On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. If you aren't willing to read the PEP to understand WSGI why are you even wanting to participate in the discussion in the first place? This is a quite detailed discussion about the future of the WSGI specification and not an IRC channel manned by ticket monkeys. :-( Graham Thanks, On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/ -- Mark Nottingham http://www.mnot.net/
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Ian] OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO introduce two equivalent variables that hold the NOT url-decoded values. [Graham] That may be fine for pure Python web servers where you control the split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as that is done by the web server. Also, as pointed out in my blog, because of rewrites in web server, it may be difficult to try and map SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and reclaim original characters. There is also the problem that often FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and manual overrides needed to tweak them. This applies doubly under Java servlets, where different containers take different approaches to solve these rather hard problems. It is worth noting that they have to do so because the java servlet spec, even under the most recent 2.5, punts on *all* of the issues being discussed here. See here for how Tomcat does it. Or half does it, messily. http://wiki.apache.org/tomcat/FAQ/CharacterEncoding I know this is not helpful ;-) Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Ian] When things get messed up I recommend people use a middleware (paste.deploy.config.PrefixMiddleware, though I don't really care what they use) to fix up the request to be correct. Pulling it from REQUEST_URI would be fine. That would be unworkable under java servlet containers, since they each take a different approach to addressing encoding issues, or fail to deal with them entirely. So there would probably have to be a special case for every single one of these http://en.wikipedia.org/wiki/List_of_Servlet_containers Each of which has a number of different ways of being configured in relation to these issues. I don't know if it would even be possible to write such a middleware. And retain all of one's hair. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: You're twisting my words; nowhere did I say i wasn't willing to read the PEP. What I did say was that a proposal can and should be made in less than eleven pages; I'd like to give my feedback, both because I use Python and because I have some interest in HTTP. However, my time is limited, and I already have a stack of other things to review on my desk. He who writes the most words does not (hopefully, for the sake of the Python community) win. I appreciate that you've taken the time to reason out a proposal, but the minutia of how you got to that place should not obscure the proposal itself. I'm not sure how to take your ticket monkeys comment, so I'll ignore it. Sorry if I come across as being short. None of us has time and this whole WSGI on Python 3.0 issue has been going on since start of last year. Many of us are quite tired of it all. I also don't personally know who you are, not recollecting seeing your name in any past discussions. I am told though you were involved back at time of original WSGI specification drafting, so apologies. The ticket monkeys reference is just the allusion to a help desk. I always think of what happens when people jump on IRC as being worst case. That is, they treat people there like help desk staff who only exist to serve them and not anyone else. So, you see people who have a complex problem, pose a question in a single line. They then expect a even more complex solution to there problem, usually expressed in one line again. There is a book I have been meaning to read called the 'Trusted Advisor' which apparently goes on about providing assistance to others as comparing the idea of being like a ticket monkey (help desk), versus building a relationship with people in order to understand their real issues and provide better solutions. Obviously being an advisor rather than a help desk is ultimately going to be better for the people needing help, but if the customer has the frame of mind that you are just the help desk and don't want to put any effort into the relationship, it is hard to try and be that advisor. So, I felt a bit like a help desk in the way I interpreted your comments. Graham On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. If you aren't willing to read the PEP to understand WSGI why are you even wanting to participate in the discussion in the first place? This is a quite detailed discussion about the future of the WSGI specification and not an IRC channel manned by ticket monkeys. :-( Graham Thanks, On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: So, what advice do you propose about decoding bytes into strings for the request-URI / method / request headers, and vice versa for response headers and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are errors handled? Are bodies still treated as binary byte sequences, as per PEP 333? I thought my blog post explained that reasonably well. Ensure you read the numbered definitions. If you can't work it out from the blog, point at the specific thing in the blog you don't understand and can help. Don't really want to go explaining it all again. Graham Cheers, On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: OK, that's quite exhaustive. For the benefit of those of us jumping in, could you summarise your proposal in something like the following manner: 1. How the request method is made available to WSGI applications 2. How the request-uri is made available to WSGI applications -- in particular, whether any decoding of punycode and/or %-escapes happens 3. How request headers are made available to WSGI apps 4. How the request body is made available to to WSGI apps 5. Likewise for how apps should expose the response status message, headers and body to WSGI implementations. Same as the WSGI PEP. http://www.python.org/dev/peps/pep-0333/ Nothing has changed in that respect. Graham Cheers, On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, P.J. Eby schrieb: Actually, latin-1 bytes encoding is the *simplest* thing that could possibly work, since it works already in e.g. Jython, and is actually in the spec already... and any framework that wants unicode URIs already has to decode them, so the code is already written. Except that nobody implements that and that Jython has a standard Python 2.x byte string. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On 22/09/2009, at 6:11 PM, Armin Ronacher wrote: Hi, Mark Nottingham schrieb: HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but HTTPbis currently takes the stance that they're ASCII, as in practice Latin-1 isn't used and may introduce interop problems. In practise non-ascii data ends up in headers. Yes. However, it shouldn't be encouraged. What does it mean to support non-ASCII headers? As per above, the only sane thing to do is treat them as opaque data, because you can't be certain of their encoding unless you have knowledge of the header. Here what http.server does in Python 3 (actual code): def send_header(self, keyword, value): Send a MIME header. if self.request_version != 'HTTP/0.9': self.wfile.write((%s: %s\r\n % (keyword, value)).encode('ASCII', 'strict')) if keyword.lower() == 'connection': if value.lower() == 'close': self.close_connection = 1 elif value.lower() == 'keep-alive': self.close_connection = 0 So it will give you a nice UnicodeEncodeError if you try to send anything outside of the ASCII range as header. Ouch. -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, Ian Bicking schrieb: Request headers, which you didn't split out... those I'm not sure. I'd *like* them to be native. But damn, I'm just not sure quite how. surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape isn't so bad. And the headers *should* be ASCII for sane requests, so it's not a horrible compromise. Except for cookie headers. Thanks to advertising and all the other system putting headers on your page you can't even properly control that one. Another thing to consider: in Python 3.1, the HTTP server internally decodes to latin1 and there is no simple way to change that, unless you replace the implementation. Ugh. wsgi.input could remain. I think at least it should become a file-like interface (i.e., giving an empty string when the content is exausted) and I might even ask that it implement .tell() (.seek() would be nice of course, but optional). If there was some other idea, I think there's room for improvement on wsgi.input and the file interface. -1 on seek and tell. This could be impossible to implement and what we really want to do is to not have the data in memory but on disk or whereever you put big-ass uploads. Also it will be hard to test for an avaiable seek or not, because even if it's a noop, the method could be there. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[P.J. Eby] Actually, latin-1 bytes encoding is the *simplest* thing that could possibly work, since it works already in e.g. Jython, and is actually in the spec already... and any framework that wants unicode URIs already has to decode them, so the code is already written. [Armin] Except that nobody implements that So, if nobody implements that, then why are we trying to standardise it? Is there a real need out there? Or are all these discussions solely driven by the need/desire to have only unicode strings in the WSGI dictionary under python 3? Which is a worthy goal, IMHO. Java has been there since the very start, since java strings have always been unicode. Take a look at the java docs for HttpServlet: no methods return bytes/bytearrays. http://java.sun.com/products/servlet/2.5/docs/servlet-2_5-mr2/javax/servlet/http/HttpServletRequest.html But the java servlet spec still ignores *all* of the encoding concerns being discussed here. Which means that mistakes/mojibake must happen all the time. And it's up to the author of the individual java web application to solve those problems, using a mechanism appropriate for their needs and local environment. Java programmers just tolerate this, although they may curse the developers of the servlet spec for not having solved their specific problem for them. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 3:16 AM, Armin Ronacher armin.ronac...@active-4.com wrote: Hi, Ian Bicking schrieb: Request headers, which you didn't split out... those I'm not sure. I'd *like* them to be native. But damn, I'm just not sure quite how. surrogateescape? Latin1? Latin1 as a kind of poor man's surrogateescape isn't so bad. And the headers *should* be ASCII for sane requests, so it's not a horrible compromise. Except for cookie headers. Thanks to advertising and all the other system putting headers on your page you can't even properly control that one. Yes, but it'd be relatively easy to handle this, especially since the raw header isn't very useful. So you just do environ['HTTP_COOKIE'].encode('latin1').decode('utf8', 'replace') before parsing. Another thing to consider: in Python 3.1, the HTTP server internally decodes to latin1 and there is no simple way to change that, unless you replace the implementation. Ugh. wsgi.input could remain. I think at least it should become a file-like interface (i.e., giving an empty string when the content is exausted) and I might even ask that it implement .tell() (.seek() would be nice of course, but optional). If there was some other idea, I think there's room for improvement on wsgi.input and the file interface. -1 on seek and tell. This could be impossible to implement and what we really want to do is to not have the data in memory but on disk or whereever you put big-ass uploads. Also it will be hard to test for an avaiable seek or not, because even if it's a noop, the method could be there. Tell doesn't have particular overhead except to keep track of how many bytes have been read. That would allow libraries to at least detect contention for wsgi.input. I wish seek were detectable, though I agree it shouldn't be required at all. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, Alan Kennedy schrieb: So, if nobody implements that, then why are we trying to standardise it? I think that was just one of the ideas that were discussed. Just to sum it up a bit where we already went: - my initial plan was going bytes everywhere. Turns out, on Python 3 this is nearly impossible to do because the majority of the standard library went an unicode path, even where bytes would be more appropriate (like cgi.FieldStorage, urllib.parse etc.) - Graham, Robert (and now me as well) try to get charset guessing for URLs going, decide on latin1 for the HTTP headers. latin1 could be re-decoded by the application if it really thinks it wanted utf-8 for instance. (Like cookie headers, only I guess only there) - One idea is enforcing unicode for all Python versions - One idea is going unicode for Python 3 and bytestrings for Python 2 - New (and old) discussions bring up the surrogate escapes. So it's quite hard to follow because different people talk about different ideas at the same time. And so far none of them looks really compelling. Is there a real need out there? In python 3, yes. Because the stdlib no longer works with bytes and the bytes object has few string semantics left. Which is a worthy goal, IMHO. Java has been there since the very start, since java strings have always been unicode. Take a look at the java docs for HttpServlet: no methods return bytes/bytearrays. And people appear to have problems with that, because what they are doing is using a specified charset that is by default iso-8859-1: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding Java programmers just tolerate this, although they may curse the developers of the servlet spec for not having solved their specific problem for them. Many Java apps are also still using latin1 only or have all kinds of problems with charsets. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Alan] Is there a real need out there? [Armin] In python 3, yes. Because the stdlib no longer works with bytes and the bytes object has few string semantics left. Why can't we just do the same as the java servlet spec? I.E. 1. Ignore the encoding issues being discussed 2. Give the programmer (possibly mojibake) unicode strings in the WSGI environ anyway 3. And let them solve their problems themselves, using server configuration or bespoke middleware [Alan] Java programmers just tolerate this, although they may curse the developers of the servlet spec for not having solved their specific problem for them. [Armin] Many Java apps are also still using latin1 only or have all kinds of problems with charsets. My point exactly. Many web developers simply never have to deal with these issues, perhaps a majority. The ones that do have to sort it out for themselves. To do so, the publishers of the various containers give them (non-standard) options to control the decoding of the incoming request and all of its component parts: you cited the Tomcat approach above. Other containers do it differently. Which means that i18n knowledge is not portable between containers. It would be nice if we could avoid such a situation with i18n and WSGI. But I suppose I'm a little dubious that this group can out-do the enormous java community, and the enormous financial resources that Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve this complex problem satisfactorily. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Armin] Because that problem was solved a long ago in applications themselves. Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating on unicode. And the way they do that is straightforward. So what are we all discussing? Those frameworks obviously have solved all of the problems of decoding incoming request components, e.g. 1. SCRIPT_NAME 2. PATH_INFO 3. QUERY_STRING 4. Etc from miscellaneous unknown character sets into unicode, with out any mistakes, under all possible WSGI environments, e.g. 1. Mod_wsgi 2. Modjy (java servlets) 3. IIS 4. CGI 5. FCGI 6. Etc So why not just adopt one of those mechanisms, e.g. Django, and make it the de-facto standard? Since they all deliver unicode, python 3 is no longer a problem, since it permits only unicode strings. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Armin] No, they know the character sets. Hmmm, define know ;-) [Armin] You tell them what character set you want to use. For example you can specify utf-8, and they will decode/encode from/to utf-8. But there is no way for the application to send information to the server before they are invoked to tell the server what encoding they want to use. I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary, so that applications that have problems, i.e. that detect mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo the faulty decoding by the server. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 10:06 AM, Alan Kennedy a...@xhaus.com wrote: [Alan] Is there a real need out there? [Armin] In python 3, yes. Because the stdlib no longer works with bytes and the bytes object has few string semantics left. Why can't we just do the same as the java servlet spec? I.E. 1. Ignore the encoding issues being discussed 2. Give the programmer (possibly mojibake) unicode strings in the WSGI environ anyway 3. And let them solve their problems themselves, using server configuration or bespoke middleware [Alan] Java programmers just tolerate this, although they may curse the developers of the servlet spec for not having solved their specific problem for them. [Armin] Many Java apps are also still using latin1 only or have all kinds of problems with charsets. My point exactly. Many web developers simply never have to deal with these issues, perhaps a majority. The ones that do have to sort it out for themselves. To do so, the publishers of the various containers give them (non-standard) options to control the decoding of the incoming request and all of its component parts: you cited the Tomcat approach above. Other containers do it differently. Which means that i18n knowledge is not portable between containers. It would be nice if we could avoid such a situation with i18n and WSGI. But I suppose I'm a little dubious that this group can out-do the enormous java community, and the enormous financial resources that Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve this complex problem satisfactorily. Alan. I think it's worth discussing and working something out that's good (good in various ways). As this is a python group, I think most of us think python does a whole bunch of things better than java(maybe wrongly... but still) ;-) cu, ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, Alan Kennedy schrieb: Hmmm, define know ;-) The charset of incoming data, the charset of URLs, the charset of outgoing data, the charset of whatever the application uses, is what the application decides it to be. Most new applications go with utf-8 for everything these days. I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary. SCRIPT_NAME and PATH_INFO are different because URLs as entered by the user will always be utf-8 in modern browsers. Even if the application decides to have latin1 URLs. Of course a server configuration variable would be a solution for many of these problems, but I don't like the idea of changing application behavior based on server configuration. At that point we will finally have successfully killed the idea of nested WSGI applications, because those could depend on different charsets. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
[Armin] Of course a server configuration variable would be a solution for many of these problems, but I don't like the idea of changing application behavior based on server configuration. So you don't like the way that Django, Werkzeug, WebOb, etc, do it now, even though they appear to be mostly successful, and you're happy to cite them as such? From the applications point of view, a framework-level configuration variable is the same as a server-level configuration variable. At that point we will finally have successfully killed the idea of nested WSGI applications, because those could depend on different charsets. Wouldn't well-written applications depend on unicode? The server configured charset is simply an explicit statement of the character set from which incoming requests are to be decoded, into unicode, and no other character set. Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 12:12 PM, Armin Ronacher armin.ronac...@active-4.com wrote: Hi, Alan Kennedy schrieb: So you don't like the way that Django, Werkzeug, WebOb, etc, do it now, even though they appear to be mostly successful, and you're happy to cite them as such? Server != Application. From the applications point of view, a framework-level configuration variable is the same as a server-level configuration variable. It is not. I can configure my framework from within Python code, But I cannot change the webserver configuration from there. Wouldn't well-written applications depend on unicode? Only internally. There is no such thing as Unicode in HTTP. hi, other points I agree with... However, remember that there is unicode in HTTP these days. As per previous conversation on RFCs stating so... and real world use of unicode in HTTP. cheers, ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Thank you Armin this makes things clear to me ( a newbie hre). On Sep 22, 2009, at 3:29 AM, Armin Ronacher wrote: - my initial plan was going bytes everywhere. Turns out, on Python 3 this is nearly impossible to do because the majority of the standard library went an unicode path, even where bytes would be more appropriate (like cgi.FieldStorage, urllib.parse etc.) I would have taken the same stand. - Graham, Robert (and now me as well) try to get charset guessing for URLs going, decide on latin1 for the HTTP headers. latin1 could be re-decoded by the application if it really thinks it wanted utf-8 for instance. (Like cookie headers, only I guess only there) If wsgi guesses the charset before will the application always be able to derive the original strings? - One idea is enforcing unicode for all Python versions - One idea is going unicode for Python 3 and bytestrings for Python 2 For what it matters I prefer the latter option. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 04:44 PM 9/22/2009 +1000, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: That blog entry is eleven printed pages. Given that PEP 333 also prints as eleven pages from my browser, I suspect there's some extraneous information in there. Could you please summarise? Requiring all comers to read such a voluminous entry is a considerable (and somewhat arbitrary) bar to entry for the discussion. If you aren't willing to read the PEP to understand WSGI why are you even wanting to participate in the discussion in the first place? This is a quite detailed discussion about the future of the WSGI specification and not an IRC channel manned by ticket monkeys. :-( Um, Graham, Mark was a major contributor to the original PEP. See: http://www.python.org/dev/peps/pep-0333/#acknowledgements I assure you, he's read the PEP quite thoroughly. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 09:23 AM 9/22/2009 +0100, Alan Kennedy wrote: [P.J. Eby] Actually, latin-1 bytes encoding is the *simplest* thing that could possibly work, since it works already in e.g. Jython, and is actually in the spec already... and any framework that wants unicode URIs already has to decode them, so the code is already written. [Armin] Except that nobody implements that So, if nobody implements that, then why are we trying to standardise it? Is there a real need out there? Or are all these discussions solely driven by the need/desire to have only unicode strings in the WSGI dictionary under python 3? Which is a worthy goal, IMHO. Java has been there since the very start, since java strings have always been unicode. Take a look at the java docs for HttpServlet: no methods return bytes/bytearrays. http://java.sun.com/products/servlet/2.5/docs/servlet-2_5-mr2/javax/servlet/http/HttpServletRequest.html But the java servlet spec still ignores *all* of the encoding concerns being discussed here. Which means that mistakes/mojibake must happen all the time. And it's up to the author of the individual java web application to solve those problems, using a mechanism appropriate for their needs and local environment. Right, and we're not going to be able to solve all the problems either. What we want -- or at least what *I* want, is to ensure that the design doesn't generate NEW opportunities for f***ing it up. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 11:28 AM 9/22/2009 +0200, Armin Ronacher wrote: Hi, Alan Kennedy schrieb: 2. Give the programmer (possibly mojibake) unicode strings in the WSGI environ anyway 3. And let them solve their problems themselves, using server configuration or bespoke middleware Because that problem was solved a long ago in applications themselves. Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating on unicode. And the way they do that is straightforward. Now currently what we have to do on Python 3 is to encode the data again and decode it with the target charset. Unnecessary roundtrips that just slow the whole thing down. What for? What roundtrips? If they're operating on unicode, either they're in violation of the spec (in which case, f*** them), or they're already running a decode every time they pull something out of the environ... and using latin-1 or surrogates is only one encoding call different from what they're doing now. So if anybody really cares about that one extra encode(), write a C function to do the transcode in a single step and add it to the stdlib for 2.x and 3, as well as publishing a standalone version. Voila. We're done and outta here. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote: I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary, so that applications that have problems, i.e. that detect mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo the faulty decoding by the server. This puts the burden on the wrong end of the pipe: there are more apps than servers and they would *all* have to check this in order to be sane. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 03:22 PM 9/22/2009 +0100, René Dudfield wrote: On Tue, Sep 22, 2009 at 3:07 PM, P.J. Eby p...@telecommunity.com wrote: At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote: I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary, so that applications that have problems, i.e. that detect mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo the faulty decoding by the server. This puts the burden on the wrong end of the pipe: there are more apps than servers and they would *all* have to check this in order to be sane. Except most everyone is using unicode in their apps already through frameworks. Great, so only the frameworks need to change, and if we use utf8 surrogateescape, only the applications which need non-utf8 encoding will need to do anything differently. That's one factor weighing towards PEP 383, vs. continuing with latin-1 or going to bytes. (Frankly, though, I'm getting tired of this handwaving about these frameworks that use unicode. If they are putting objects of type 'unicode' under WSGI-defined environ keys on Python 2.x, they are *not WSGI compliant*. And conversely, if they are doing some kind of conversion already, it's not gonna kill them to do a slightly different conversion to support the new version of WSGI.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby [mailto:p...@telecommunity.com] At 07:40 PM 9/21/2009 -0700, Robert Brewer wrote: Yes; you have to transcode to the correct encoding. Once. Then every other WSGI application interface below that one doesn't have to care. You can only do that if you *break encapsulation*, which as I said earlier is voiding the entire point of having a modular interface. Requiring one component to run before another to achieve a correct result does not void modularity. Unix pipes employ a modular interface, but cat /etc/fstab | wc | head produces a very different result than cat /etc/fstab | head | wc. In such a system, encapsulation requires that the components not share state, but rather trust that they are composed correctly (yes, by some invisible hand) and that the given input is the intended one, even if that means a previous component transformed it. If, on the other hand, only utf-8-decoded strings can be passed as input to each WSGI component, then each WSGI component must be prepared to re-decode its inputs; in that case, each must be configured identically with the same logic to determine the correct decoding, since the correct decoding does not differ from one component to the next. That repeated configuration of the correct decoding is shared state, and breaks encapsulation; one-time transformation of inputs is not and does not. Having a configurable encoding just means that *every* WSGI application *must* verify the encoding in order to be safe. No, each can trust its inputs and do its intended job instead, if your idempotency requirement is relaxed. I'm all in favor of making everyone suffer equally, but all else being equal, I'd prefer them to suffer idempotently rather than conditionally. ;-) I know you do, but I don't see the community following your lead in that preference. Any middleware that alters the environ breaks idempotency. Any middleware that alters the output breaks idempotency. Most routing middleware breaks idempotency. There's a lot of all of those already in the wild. CherryPy doesn't care, because we marginalized WSGI middleware into near obscurity. We did that in large part because of the idempotency requirements of WSGI 1.0. We may have the only routing middleware that you could mistakenly put in your stack twice and get the same result! So I'm not fighting for myself/my framework on this; surrogateescape would work just fine for us since we ship very little middleware. But I don't think it would work fine for Paste, Pylons, Turbogears, Repoze, etcetera etcetera who have lots of WSGI middleware to port and more they want to build, and have been chafing for years now against this requirement. I believe they want full unicode SCRIPT_NAME and PATH_INFO, and would prefer a single, new, modular WSGI component be inserted in their component graphs than to build that logic into every WSGI component. They already have to deal with correct ordering in their WSGI component graphs, because they've already abandoned strict idempotency. Ben, Ian, Mark, Chris, et al, please confirm or deny that; I could be way off base. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Alan Kennedy wrote: Why can't we just do the same as the java servlet spec? Because Servlet is a walking, stinking demonstration of how *not* to handle encodings. Every servlet container has its own different method of selecting input character sets, and the default encoding is almost never right. Most deployed JSP applications out there are using the wrong charset and do the wrong thing with any non-ASCII character. This is not something to aim for. Pushing the choice of encodings out to a 'deployment issue' where the application doesn't get to decide is a Wrong Thing. I hate dealing with this nonsense in Java and I do not want the same approach to become common in Python. I see this as being the same as Graham's suggested approach of a per-server configurable charset This is absolutely the opposite of what I want as an application author. I want to hand out my WSGI application that uses UTF-8 and know that wherever it is deployed the non-ASCII characters will go through without getting mangled. The application (perhaps via a framework it is using) is the party that is in the best place to know what character encoding it wants to deal with. Give the application a consistent way to handle that encoding itself, because the poor sod deploying it isn't going to know any better. Those frameworks obviously have solved all of the problems of decoding incoming request components from miscellaneous unknown character sets into unicode, with out any mistakes Er, no. That's the point. It cannot currently be done in all deployment environments. When they're not running via their own built-in servers, the frameworks have to do the same as the rest of us: guess. That guess may not be as troublesome as it is in Java (mainly because for us it doesn't affect QUERY_STRING parameters), but it's still not reliable, which is why today you can't have a WSGI application with pretty non-ASCII URLs that will deploy consistently. I want this fixed. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, And Clover schrieb: This is absolutely the opposite of what I want as an application author. I want to hand out my WSGI application that uses UTF-8 and know that wherever it is deployed the non-ASCII characters will go through without getting mangled. I could not agree more. Probably the best way is indeed using native strings for each Python version, where native strings are unicode the server should latin1 decode it and SCRIPT_NAME / PATH_INFO will be called wsgi.raw_script_name and wsgi.raw_path_info and be properly quoted. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
+1 On Sep 22, 2009, at 10:45 AM, Armin Ronacher wrote: Hi, And Clover schrieb: This is absolutely the opposite of what I want as an application author. I want to hand out my WSGI application that uses UTF-8 and know that wherever it is deployed the non-ASCII characters will go through without getting mangled. I could not agree more. Probably the best way is indeed using native strings for each Python version, where native strings are unicode the server should latin1 decode it and SCRIPT_NAME / PATH_INFO will be called wsgi.raw_script_name and wsgi.raw_path_info and be properly quoted. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Graham wrote: Armin has fast asleep now, so my shift. Heh. It's a multiple-man job keeping up with this monster thread! The URLs don't break. Not in themselves. Just the language of the PEP implies that to fix them up would contravene the spec: The application MUST use [the encoding guess for PATH_INFO] to decode the ``'QUERY_STRING'`` as well. This isn't appropriate even as a SHOULD: the guessed encoding for PATH_INFO is very likely to be wrong, in particular for cases where the path was purely ASCII. The application (or a library/framework acting on its behalf) should be allowed to decode QUERY_STRING using whatever encoding it is expecting. Disallowing using anything other than utf-8 (and iso-8859-1 in a very unreliable way) makes it impossible to have queries in any other encoding at all and still comply with the spec, which is undesirable. If this sentence is removed, and `wsgi.uri_encoding` is guaranteed to be one of: a. definitive and reliable, or b. missing/None I'm pretty much happy. What I don't want is that half the future-WSGI servers/gateways decide they have to provide *some* value for `wsgi.uri_encoding` even if they're not quite sure if it's the right one. Then we're back to square one. if it is known that an application or some subset of URLs will always be receiving a request as non UTF-8, then it should employ code in those cases to always transcode it to the required encoding. Yep, agreed. I think the PEP should clarify that; at the moment it is saying that a transcode is something you should only do for the iso-8859-1 case, but if you actually followed that advice you'd get highly inconsistent results. Perhaps we're at cross-purposes as to what exactly consistutes 'middleware'... The other fallback is that a specific WSGI server could elect to provide an option to not use 'UTF-8' as the first choice for decoding I really, *really* hope this does not happen. That just brings us more deployment heartaches. Whether surrogateescape gives a better solution I have no idea at this point Yeah... I'm highly suspicious of surrogateescape in a web context and personally my code will be deliberately filtering all such characters out. I can see it being a possible way to smuggle unwanted sequences (such as overlongs) through filters, potentially causing endless security problems. But we'll see... -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 3:07 PM, P.J. Eby p...@telecommunity.com wrote: At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote: I see this as being the same as Graham's suggested approach of a per-server configurable charset, which is then stored in the WSGI dictionary, so that applications that have problems, i.e. that detect mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo the faulty decoding by the server. This puts the burden on the wrong end of the pipe: there are more apps than servers and they would *all* have to check this in order to be sane. Except most everyone is using unicode in their apps already through frameworks. If the web clients are moving towards unicode, the HTTP RFCs(and most other internet protocols), python, and python frameworks, and other languages frameworks all moving towards unicode why should wsgi not? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Sep 22, 2009, at 2:28 AM, Armin Ronacher wrote: Hi, Alan Kennedy schrieb: 2. Give the programmer (possibly mojibake) unicode strings in the WSGI environ anyway 3. And let them solve their problems themselves, using server configuration or bespoke middleware Because that problem was solved a long ago in applications themselves. Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating on unicode. And the way they do that is straightforward. Werkzeug/WebOb/Paste all seem to have standardized on: return unicode, lazily decoded via a default encoding which can be overridden by the app via some API. The Java servlet spec actually defines a ServletRequest.setCharacterEncoding(String enc) method, which lets the app override the encoding of the body/params (though not the URL on some containers), as long it's done before the body is read. Pretty much what said Python wrappers are doing. Now currently what we have to do on Python 3 is to encode the data again and decode it with the target charset. Unnecessary roundtrips that just slow the whole thing down. What for? Because our request container is a plain, pre-fabricated dict that doesn't permit the lazy behavior. -- Philip Jenvey ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 05:12 PM 9/22/2009 -0700, Philip Jenvey wrote: Because our request container is a plain, pre-fabricated dict that doesn't permit the lazy behavior. Not quite true; you can always write a library function, get_foo(environ) that does the lazy caching in a private environ key, at the cost of also caching the original value and doing a consistency check. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, Robert Brewer schrieb: urllib.unquote, for one. We had to make a version which accepts bytes (and outputs bytes). But it's only 8 lines of code. Here a patch for urllib.parse that restores Python 2.x behavior. Because it also changes behavior for Python 3.x I have not yet submitted it for discussions: http://paste.pocoo.org/show/140739/ This adds byte support for all unquoting functions and URL parsing and joining. It also changes the quoting functions to return bytes when passed bytes. The latter is something that most likely does not survive a review on python-dev. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough chr...@plope.com wrote: WSGI is a fairly low-level protocol aimed at folks who need to interface a server to the outside world. The outside world (by its nature) talks bytes. I fear that any implied conversion of environment values and iterable return values to Unicode will actually eventually make things harder than they are now. I realize that it would make middleware implementors lives harder to need to deal in bytes. However, at this point, I also believe that middleware kinda should be hard. We have way too much middleware that shouldn't be middleware these days (some written by myself). Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an interface to HTTP should deal in bytes as well. The problem, really is that despite being a very low-level interface, WSGI has a tendency to leak up into much higher-level code, and (IMO) authors of that high-level code really shouldn't have to waste their time dealing with details of the underlying low-level gateway. You've said you don't want to hear Python 3 as the reason, but it provides some useful examples: in high-level code you'll commonly want to be doing things like, say, comparing parts of the requested URL path to known strings or patterns. And that high-level code will almost certainly use strings, while WSGI, in theory, will be using bytes. That's just a recipe for disaster; if WSGI mandates bytes, then bytes will have to start infecting much higher-level code (since Python 3 -- rightly -- doesn't let you be nearly as promiscuous about mixing bytes and strings). Once I'm at a point where I can use Python 3, I know I'll personally be looking for some library which will normalize everything for me before I interact with it, precisely to avoid this sort of leakage; if WSGI itself would at least *allow* that normalization to happen at the low level (mandating it is another discussion entirely) I'd feel much happier about it going forward. -- Bureaucrat Conrad, you are technically correct -- the best kind of correct. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, James Bennett schrieb: Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an interface to HTTP should deal in bytes as well. If it was just that I would be happy to stay with bytes. But unless the standard library changes in the way it works on Python 3 there is not much but unicode we can use. bytes no longer behave like strings, it's not very comfortable to work with them. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 1:28 AM, Armin Ronacher armin.ronac...@active-4.com wrote: If it was just that I would be happy to stay with bytes. But unless the standard library changes in the way it works on Python 3 there is not much but unicode we can use. bytes no longer behave like strings, it's not very comfortable to work with them. Indeed. Hence my comments about WSGI leaking up into other code. Now that bytes and strings are incompatible, a lot of code which relied on (arguably) a wart in Python will break. -- Bureaucrat Conrad, you are technically correct -- the best kind of correct. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
OK, after some consideration, I think I'm sold. Answering my own original question about why unicode seems to make sense as values in the WSGI environment even without consideration for Python 3 compatibility: *something* needs to do this translation. Currently I personally rely on WebOb to do a lot of this translation. I can't think of a good reason that implementations at the level of WebOb would each need to do this translation work; pushing the job into WSGI itself seems to make sense here. This is particularly true for PATH_INFO and QUERY_STRING; these days it's foolish to assume these values will be entirely composed of low order characters, and thus being able to access them as bytes natively isn't very useful. OTOH, I suspect the Python 3 stdlib is still broken if it requires native strings in various places (and prohibits the use of bytes). James Bennett wrote: On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough chr...@plope.com wrote: WSGI is a fairly low-level protocol aimed at folks who need to interface a server to the outside world. The outside world (by its nature) talks bytes. I fear that any implied conversion of environment values and iterable return values to Unicode will actually eventually make things harder than they are now. I realize that it would make middleware implementors lives harder to need to deal in bytes. However, at this point, I also believe that middleware kinda should be hard. We have way too much middleware that shouldn't be middleware these days (some written by myself). Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an interface to HTTP should deal in bytes as well. The problem, really is that despite being a very low-level interface, WSGI has a tendency to leak up into much higher-level code, and (IMO) authors of that high-level code really shouldn't have to waste their time dealing with details of the underlying low-level gateway. You've said you don't want to hear Python 3 as the reason, but it provides some useful examples: in high-level code you'll commonly want to be doing things like, say, comparing parts of the requested URL path to known strings or patterns. And that high-level code will almost certainly use strings, while WSGI, in theory, will be using bytes. That's just a recipe for disaster; if WSGI mandates bytes, then bytes will have to start infecting much higher-level code (since Python 3 -- rightly -- doesn't let you be nearly as promiscuous about mixing bytes and strings). Once I'm at a point where I can use Python 3, I know I'll personally be looking for some library which will normalize everything for me before I interact with it, precisely to avoid this sort of leakage; if WSGI itself would at least *allow* that normalization to happen at the low level (mandating it is another discussion entirely) I'd feel much happier about it going forward. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 7:28 AM, Armin Ronacher armin.ronac...@active-4.com wrote: Hi, James Bennett schrieb: Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an interface to HTTP should deal in bytes as well. If it was just that I would be happy to stay with bytes. But unless the standard library changes in the way it works on Python 3 there is not much but unicode we can use. bytes no longer behave like strings, it's not very comfortable to work with them. I think http traffic is increasingly more utf-8 these days. Also most upper level frame works use unicode natively. So it makes sense to use utf-8 natively, as an option. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough chr...@plope.com wrote: OTOH, I suspect the Python 3 stdlib is still broken if it requires native strings in various places (and prohibits the use of bytes). yes, python3 stdlib should support 'str'(the old unicode), 'buffer' and 'bytes' for web using stuff. Buffer is important because it's a type also used for sockets(along with bytes) and it allows less memory allocation (because you can reuse buffers). cheers, ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
René Dudfield schrieb: On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough chrism-ccarnewbnkgavxtiumw...@public.gmane.org wrote: OTOH, I suspect the Python 3 stdlib is still broken if it requires native strings in various places (and prohibits the use of bytes). yes, python3 stdlib should support 'str'(the old unicode), 'buffer' and 'bytes' for web using stuff. Buffer is important because it's a type also used for sockets(along with bytes) and it allows less memory allocation (because you can reuse buffers). Please don't confuse readers and use the correct name, i.e. 'bytearray' instead of 'buffer'. Georg ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl g.bra...@gmx.net wrote: René Dudfield schrieb: On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough chrism-ccarnewbnkgavxtiumw...@public.gmane.org wrote: OTOH, I suspect the Python 3 stdlib is still broken if it requires native strings in various places (and prohibits the use of bytes). yes, python3 stdlib should support 'str'(the old unicode), 'buffer' and 'bytes' for web using stuff. Buffer is important because it's a type also used for sockets(along with bytes) and it allows less memory allocation (because you can reuse buffers). Please don't confuse readers and use the correct name, i.e. 'bytearray' instead of 'buffer'. Georg Let me try and reduce the confusion... There are two different python types the py3k socket module uses: 'bytes' and 'buffer'. 'bytes' is kind of like str in python3... but with reduced functionality (no formatting, less methods etc). buffer is a Py_buffer from the c api. buffer, and bytes in socket: http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into bytearray: http://docs.python.org/3.1/library/functions.html#bytearray bytes: http://docs.python.org/3.1/library/functions.html#bytes buffer: http://docs.python.org/3.1/c-api/buffer.html This is separate, but related to the point of bytes vs unicode. It is really (bytes and buffer) vs unicode - since bytes and buffer can be used with socket. socket never uses a python2 'unicode', or a python3 'str' type. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote: At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to it will be compatible with Python 3. +1. I'd really rather not have the spec dictated by the need to work around problems in the stdlib or language definition. Better to fix them ASAP. hi, here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 4:27 PM, James Bennett ubernost...@gmail.com wrote: On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby p...@telecommunity.com wrote: +1. I'd really rather not have the spec dictated by the need to work around problems in the stdlib or language definition. Better to fix them ASAP. This is a *Python* web server gateway interface, yes? Fixing stdlib bugs is fine, but asking for the language to change just to make gateway interfaces a bit easier to write seems a bit much; I'd hope we can take Python the language as granted, and work from there. Hi, I mostly agree... However, python3.x changes are still up for grabs... so if there's a good enough reason, now is the time to ask for changes. I don't see them changing the way unicode, strings and bytes work too much though. cheers, ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 04:30 PM 9/21/2009 +0100, René Dudfield wrote: On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote: At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to it will be compatible with Python 3. +1. I'd really rather not have the spec dictated by the need to work around problems in the stdlib or language definition. Better to fix them ASAP. hi, here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Since WSGI is based on HTTP, please cite RFCs, not applications. Thanks. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 4:42 PM, P.J. Eby p...@telecommunity.com wrote: At 04:30 PM 9/21/2009 +0100, René Dudfield wrote: On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote: At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote: Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to it will be compatible with Python 3. +1. I'd really rather not have the spec dictated by the need to work around problems in the stdlib or language definition. Better to fix them ASAP. hi, here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Since WSGI is based on HTTP, please cite RFCs, not applications. Thanks. Hi, That seems a strange thing to say. HTTP use is based on not only RFCs but real applications. Web Server Gateway Interface is not just about HTTP obviously, and talks about python and web server issues... it hardly restricts itself to HTTP. See IRIs: http://www.w3.org/International/O-URL-and-ident.html Which links to a number of things including rfc2718, which specifies utf-8 for URIs: http://www.ietf.org/rfc/rfc2718.txt Character encoding section: Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279) [3] and then subsequently using the %HH encoding for unsafe octets is recommended. Which seems sensible. Having fallback to the raw bytes available also seems sensible. For the reasons discussed in previous posts. cheers, ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Sun, Sep 20, 2009 at 8:06 AM, Armin Ronacher armin.ronac...@active-4.com wrote: Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. 1.1 I think of as an errata on 1.0, so... simple enough. I was skeptical about a unicode version of WSGI, but I think I'm okay with it now. For people who use UTF-8-only it should be fairly simple and easy; for people who want to deal with other encodings, backward compatible URLs, or other weirdness I think surrogateescape can resolve the small handful of problems. Maybe an option to use latin1 (at the server level) would do the same for Python 2, as a deployment option for people who are dealing with these tricky issues. Which is kind of lame, but it means everything is still *possible*, and the use cases are somewhat obscure. Especially because QUERY_STRING and wsgi.input remain bytes. (Well, I guess the other case would be someone reading a cookie set by an application they do not control, and set in a crazy way... but anyway, there's a handful of use cases where things get tricky, but we can kind of punt, or try to implement the necessary transcoding routines before the spec is final.) I'm very much opposed to a second raw version of the request, as I do not like redundancy. With respect to 3.0/start_response, I'd rather we just do both at once, so there's not so many versions of WSGI to worry about. Also it doesn't feel like a very difficult change to make. The only other major issue is wsgi.input, which is a quite awkward interface to the request body. But I think resolving that is harder than start_response, in particular because there's no clear solution. Maybe at least switching to a file interface would be better. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby wrote: At 07:57 AM 9/21/2009 +0200, Armin Ronacher wrote: Chris McDonough schrieb: Personally, I find it a bit hard to get excited about Python 3 as a web application deployment platform. Everybody feels that way currently. But if we don't fix WSGI that will never change. This is only compounding the errors introduced by the make the tests pass philosophy of porting the stdlib. We should not make them worse. At the moment (AFAIK) nobody has gone through the web bits of the stdlib and asked, Should this work on strings, bytes, or both, and if both, how should that API be expressed? Perhaps not, but I wrote unquote_bytes at PyCon 2009, after discussing urllib in the python-dev room and being told no bytes-compatible version was desired in the stdlib. So *some* thought has gone into it. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 11:09:24AM -0500, Ian Bicking wrote: I think surrogateescape can resolve the small handful of problems. +1 surrogateescape would be a great alternative to the try utf-8 then latin-1 approach. It would simplify the gateway and the application. No need to check some 'encoding' variable and transcode later. We just encode everything to UTF-8, no special case. surrogateescape isn't implemented (yet?) for Python 2. That's not an issue if the 'new' WSGI sticks to native strings. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Armin Ronacher wrote: The middleware can never know. It's much more likely than to know than the server though! WSGI will demand UTF-8 URLs and only provide iso-XXX support for backwards compatibility. It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm as much an advocate of UTF-8 for everything everywhere! as anyone else, but unfortunately today there are still dark places where you need non-UTF-8 URLs. Incidentally, if wsgi.uri_encoding is going to be the way to signal that the server has decoded bytes to characters using a known encoding, it should be stressed that this should only be set when that encoding is certain. That is, wsgi.uri_encoding should be omitted (or None?) in cases where another party has already decoded (and maybe mangled) the bytes using an unknown encoding. In particular, CGI. (In the case of Windows CGI the server will have decoded URI bytes into Unicode characters, using a charset which it is impossible to find out. In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF sequence, otherwise it's the system codepage. This problem affects the non-CGI implementation isapi_wsgi, too. Then the variables are read as environment variables, which for Python 2 means another encode/decode step on Windows using the system codepage, mangling non-codepage characters. Python 3 has the opposite problem reading byte envvars using UTF-8, which won't be how Apache put them there.) If wsgi.encoding is obligatory then in reality it will often be wrong, leaving us in the same pathetic predicament as with WSGI 1.0, where non-ASCII URIs don't work reliably at all. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
René Dudfield wrote: On Mon, Sep 21, 2009 at 6:05 PM, Robert Brewer fuman...@aminus.org wrote: Armin Ronacher wrote: WSGI will demand UTF-8 URLs and only provide iso-XXX support for backwards compatibility. WSGI cannot demand that; a recommendation for utf-8 in a few draft specifications is at least a decade removed from ubiquitous implementation. We can default to utf-8 at best. I discussed this at length in http://mail.python.org/pipermail/web-sig/2009-August/003948.html that post does have good arguments why a single encoding is not acceptable. utf-8 seems the most common at this point to be the default... but we do need a way to specify encoding. Is that what you're saying Robert? Do you have a suggestion for specifying encodings? CherryPy 3.2 does this (pseudocode): try: decode_uri(userdefault or 'utf-8') except UnicodeDecodeError: decode_uri('iso-8859-1') I think surrogateescape will handle the issues with allowing bytes to be stored in utf-8. http://www.python.org/dev/peps/pep-0383/ However, I think that is only implemented in python 3.1?... but maybe there is someway to have it work on other pythons too? As Henry Prêcheur says, that's not an issue if the 'new' WSGI sticks to native strings. Which I'd be happy with. How about... Being able to request which encoding you want has the benefit of only having to store one representation before 'baking' the result into the environ. So if someone only ever wants utf-8 they can get it... however if they choose to 'bake' the environ then they can request something else. This is similar to a per server setting, but I think should work with middleware too? As noted above, it *is* a per-server setting in CherryPy 3.2. And any middleware can certainly be configured as its authors see fit; I don't see a need for a generic mechanism to specify what encodings middleware should try. However, we still need a generic mechanism declaring which encoding was successfully used; this is 'wsgi.uri_encoding'. As multiple things should be available, and if baked middleware (if it wants to modify things, will need to change each version of things). These 'baking' methods could live in wsgi to simplify modifying the environs multiple versions of things. It would just have some get/set functions to put correct handling of encodings in one place. Of course middleware is still free to change things as it wants. I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. Maybe I'm missing something here, but the only way I see to preserve composability here is to use latin-1 or bytes. The fundamental problem is that, like it or not, HTTP headers are actually byte strings. The *only* reason we ever supported unicode in WSGI was to handle platforms where there's no such thing as a non-unicode string, and there we made it explicit that it's just a way of manipulating *bytes*, not unicode. ISTM that very few (if any) of the proposals floating around for modifying WSGI are taking this concept into account. Most of them sound to me like people saying, yeah, but this particular hack will work for *my* apps... so everybody else must be doing something stupid. But WSGI was built on the principle of *equally inconveniencing everyone*, specifically to avoid an impossible attempt at consensus between incompatible ways of doing things. (E.g., nine million request/response APIs.) So, if the only problem we're going to cause by using bytes everywhere is to make everyone need to change their routing code on Python 3, I vote +1000. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 8:31 PM, P.J. Eby p...@telecommunity.com wrote: At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. It seems latin-1 has the same problem. If middleware makes an artbitary change, how can later things know what it's done? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby wrote: At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are unicode, the only answer to what's been done to the URI? can be wsgi.uri_encoding, which allows someone to un-do it. What more do you want? 1. bytes arrive. server decodes with utf8, sets 'wsgi.uri_encoding' to 'utf-8'. 2. middleware says oops, that's wrong. encodes back to bytes using 'utf-8', and re-decodes with koi-8, changing wsgi.uri_encoding to 'koi-8' 3. further middlewares and app use the unicode value, and don't really care what encoding was used. Maybe I'm missing something here, but the only way I see to preserve composability here is to use latin-1 or bytes. The fundamental problem is that, like it or not, HTTP headers are actually byte strings. The *only* reason we ever supported unicode in WSGI was to handle platforms where there's no such thing as a non-unicode string, and there we made it explicit that it's just a way of manipulating *bytes*, not unicode. ISTM that very few (if any) of the proposals floating around for modifying WSGI are taking this concept into account. Most of them sound to me like people saying, yeah, but this particular hack will work for *my* apps... so everybody else must be doing something stupid. But WSGI was built on the principle of *equally inconveniencing everyone*, specifically to avoid an impossible attempt at consensus between incompatible ways of doing things. (E.g., nine million request/response APIs.) So, if the only problem we're going to cause by using bytes everywhere is to make everyone need to change their routing code on Python 3, I vote +1000. ;-) That's not the only problem. Using native strings wherever possible makes web programing in Python easier, regardless of version. In Python 3, that happens to be unicode, for good reasons. For HTTP, there's a more specific reason: URI's should be compared for equivalence character by character, not byte by byte. See http://tools.ietf.org/html/rfc3986#section-6.2.1. That includes routing middleware. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 01:15 PM 9/21/2009 -0700, Robert Brewer wrote: I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are unicode, the only answer to what's been done to the URI? can be wsgi.uri_encoding, which allows someone to un-do it. What more do you want? To be sure that there's no possible way for all the broken middleware out there to mess this up. Let me put it this way: out of all the times I've seen people post example WSGI 1 middleware code, I don't remember *any* where the middleware was actually complying with the spec correctly... and that includes examples I wrote myself. So I'm not real impressed with any solution that requires middleware to get it right. That having been said, I'm beginning to think that PEP 383 (surrogateescape) is actually the way to go, now that I've looked over the PEP, docs, and Ian's posts here about it. First, it's compatible with CGI (os.environ) right off the bat, as well as being the standard way to handle this sort of issue in Python 3. Second, it's redundancy-free: you don't need a separate environ key to know what's going on. Third, it's unconditional: if you want bytes or a non-UTF-8 encoding you perform the same steps every time. Up until now, I've not paid much attention because so many people kept saying you can't get surrogateescape on Python 2. However, that's only an issue for code that *needs the original byte string*, as the old codec error handler API is sufficient for doing decoding. (Meaning you could register a handler for it on older Pythons.) I think this approach would let us have our cake and eat it too, for the most part. WSGI on Python 2.x uses byte strings for these, and then 3.x works transparently. It's a bit of a stretch to call it a clarification of WSGI 1.0, but since for all intents and purposes WSGI doesn't really *run* on Python 3, it might be the way to go. To be clear, I'm talking about simply allowing (on Python 3 and in WSGI versions1.0) for all environ values to be utf-8-decoded, surrogate-escaped unicode values, in the native string case. (This would further imply that a CGI gateway would have to check whether the system encoding is UTF-8, and if not, transcode accordingly.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote: So the same standard should have different behavior on different Python versions? That would make framework code a lot more complicated. I don't understand why it would be 'a lot more' complicated. (The following code snippets is Python 3 only, and assumes we're using 'native strings' everywhere) In the gateway, environ would be populated this way: environ['some_key'] = some_value.decode('utf8', 'surrogateescape') Compare that to the utf-8-then-latin-1 alternative: try: environ['some_key'] = some_value.decode('utf-8') environ['some_key.encoding'] = 'utf-8' except UnicodeError: environ['some_key'] = some_value.decode('latin-1') environ['some_key.encoding'] = 'latin-1' What you would have in the application to get the original value: environ['some_key'].encode('utf8', 'surrogateescape') With utf8-then-latin1: environ['some_key'].encode(environ['some_key.encoding']) The 'surrogateescape' way is clearly simpler. The 'equivalent' Python 2 code is even simpler: environ['some_key'] = some_value And: environ['some_key'] -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Henry Precheur wrote: On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote: So the same standard should have different behavior on different Python versions? That would make framework code a lot more complicated. I don't understand why it would be 'a lot more' complicated. (The following code snippets is Python 3 only, and assumes we're using 'native strings' everywhere) In the gateway, environ would be populated this way: environ['some_key'] = some_value.decode('utf8', 'surrogateescape') Compare that to the utf-8-then-latin-1 alternative: try: environ['some_key'] = some_value.decode('utf-8') environ['some_key.encoding'] = 'utf-8' except UnicodeError: environ['some_key'] = some_value.decode('latin-1') environ['some_key.encoding'] = 'latin-1' What you would have in the application to get the original value: environ['some_key'].encode('utf8', 'surrogateescape') With utf8-then-latin1: environ['some_key'].encode(environ['some_key.encoding']) The 'surrogateescape' way is clearly simpler. It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. I am not sure I understand your point. The 0 lines hold true if the whole site is using latin-1 or utf-8 and you write your applications/middlewares only for this site. But if it's using any other encoding you still have to transcode. def middleware(start_response, environ): value = environ['some_key'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) ... With wsgi.uri_encoding you would still have to do the following: def middleware(start_response, environ): value = environ['some_key'].\ encode(environ['some_key.encoding']).\ decode(SITE_ENCODING) ... Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Henry Precheur he...@precheur.org: On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. I am not sure I understand your point. The 0 lines hold true if the whole site is using latin-1 or utf-8 and you write your applications/middlewares only for this site. But if it's using any other encoding you still have to transcode. def middleware(start_response, environ): value = environ['some_key'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) ... With wsgi.uri_encoding you would still have to do the following: def middleware(start_response, environ): value = environ['some_key'].\ encode(environ['some_key.encoding']).\ decode(SITE_ENCODING) ... Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? For one, we aren't talking about arbitrary keys needing this treatment. We are only talking about SCRIPT_NAME and PATH_INFO. Everything else from CGI will be passed as ISO-8859-1 and up to WSGI components/applications to explicitly worry about those if need to deal with them in special ways. Eg., REQUEST_URI, QUERY_STRING, HTTP_COOKIE, HTTP_REFERRER. Thus, your use of 'some_key' all the time is a bit confusing when just trying to scan the emails quickly. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 03:26 PM 9/21/2009 -0700, Robert Brewer wrote: It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. Unless I'm misunderstanding something, you end up adding an extra if statement *everywhere*, to check whether wsgi.uri_encoding is what you want it to be or not. (Btw, this whole notion of talking about WSGI sites also doesn't make sense, since WSGI doesn't have sites, it has recursively-composable application objects. Sure, if you're using a monolithic framework, you can think of applications as unified entities, but that's not true of WSGI as a whole.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but HTTPbis currently takes the stance that they're ASCII, as in practice Latin-1 isn't used and may introduce interop problems. http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-07#section-4.2 Historically, HTTP has allowed field-content with text in the ISO- 8859-1 [ISO-8859-1] character encoding (allowing other character sets through use of [RFC2047] encoding). In practice, most HTTP header field-values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD constrain their field-values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field-content as opaque data. What does it mean to support non-ASCII headers? As per above, the only sane thing to do is treat them as opaque data, because you can't be certain of their encoding unless you have knowledge of the header. On 21/09/2009, at 12:50 AM, Armin Ronacher wrote: Also (something I haven't yet filed as a bug because I guess there will be more changes involved) the HTTP server in Python 3.1 does not support non-ASCII headers. -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
+1. There is no one answer for these issues (e.g., URI-IRI conversion can lose information), so low-level infrastructure like WSGI shouldn't be making choices for people. On 22/09/2009, at 5:31 AM, P.J. Eby wrote: At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request- URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. Maybe I'm missing something here, but the only way I see to preserve composability here is to use latin-1 or bytes. The fundamental problem is that, like it or not, HTTP headers are actually byte strings. The *only* reason we ever supported unicode in WSGI was to handle platforms where there's no such thing as a non- unicode string, and there we made it explicit that it's just a way of manipulating *bytes*, not unicode. ISTM that very few (if any) of the proposals floating around for modifying WSGI are taking this concept into account. Most of them sound to me like people saying, yeah, but this particular hack will work for *my* apps... so everybody else must be doing something stupid. But WSGI was built on the principle of *equally inconveniencing everyone*, specifically to avoid an impossible attempt at consensus between incompatible ways of doing things. (E.g., nine million request/response APIs.) So, if the only problem we're going to cause by using bytes everywhere is to make everyone need to change their routing code on Python 3, I vote +1000. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
+1 On Sep 21, 2009, at 8:28 PM, Mark Nottingham wrote: +1. There is no one answer for these issues (e.g., URI-IRI conversion can lose information), so low-level infrastructure like WSGI shouldn't be making choices for people. On 22/09/2009, at 5:31 AM, P.J. Eby wrote: At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote: I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request- URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably. The problem with this whole approach is that it's not composable. You can't stick in an application under a router that uses a different method for grokking its subtree of the URI space, unless it knows what's been done to the URI and can un-do it. Maybe I'm missing something here, but the only way I see to preserve composability here is to use latin-1 or bytes. The fundamental problem is that, like it or not, HTTP headers are actually byte strings. The *only* reason we ever supported unicode in WSGI was to handle platforms where there's no such thing as a non- unicode string, and there we made it explicit that it's just a way of manipulating *bytes*, not unicode. ISTM that very few (if any) of the proposals floating around for modifying WSGI are taking this concept into account. Most of them sound to me like people saying, yeah, but this particular hack will work for *my* apps... so everybody else must be doing something stupid. But WSGI was built on the principle of *equally inconveniencing everyone*, specifically to avoid an impossible attempt at consensus between incompatible ways of doing things. (E.g., nine million request/response APIs.) So, if the only problem we're going to cause by using bytes everywhere is to make everyone need to change their routing code on Python 3, I vote +1000. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Reference? On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Mark Nottingham m...@mnot.net: Reference? See: http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html Anyone else jumping in on this conversation with their own opinions and who has not read it, should perhaps at least read that. Also read some of the earlier posts in the numerous discussions this spawned at: http://groups.google.com/group/python-web-sig?lnk= as the current thinking isn't exactly what I blogged about and has shifted a bit as the discussion has progressed. Graham On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote: 2009/9/22 Mark Nottingham m...@mnot.net: Most things is not the Web. How will you handle serving images through WSGI? Compressed content? PDFs? You are perhaps misunderstanding something. A WSGI application still should return bytes. The whole concept of any sort of fallback to allow unicode data to be returned for response content was purely so the canonical hello world application as per Python 2.X could still be used on Python 3.X. So, we aren't saying that the only thing WSGI applications can return is unicode strings for response content. Have you read my original blog post that triggered all this discussion this time around? Graham On 22/09/2009, at 1:30 AM, René Dudfield wrote: here is a summary: Apart from python3 compatibility(which should be good enough reason), utf-8 is what's used in http a lot these days. Most things layered on top of wsgi are using utf-8 (django etc), and lots of web clients are using utf-8 (firefox etc). Why not move to unicode? -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com -- Mark Nottingham http://www.mnot.net/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Armin has fast asleep now, so my shift. :-) He did point me to this specific email for closer attention, indicating issues with QUERY_STRING and wsgi.uri_encoding due to something mentioned here. I didn't quite get what he was talking about, but then I believe he has some wrong statements in his PEP-XXX about QUERY_STRING. I'll make a a few of my own comments about this email, and then maybe those who are still awake can help in understanding issues raised here. 2009/9/22 And Clover and...@doxdesk.com: Armin Ronacher wrote: The middleware can never know. It's much more likely than to know than the server though! WSGI will demand UTF-8 URLs and only provide iso-XXX support for backwards compatibility. It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm as much an advocate of UTF-8 for everything everywhere! as anyone else, but unfortunately today there are still dark places where you need non-UTF-8 URLs. The URLs don't break. As mentioned elsewhere, but perhaps not overly clear is that if it is known that an application or some subset of URLs will always be receiving a request as non UTF-8, then it should employ code in those cases to always transcode it to the required encoding. Thus something like: import codecs iso_8859_7 = codecs.lookup('iso-8859-7') def redecode(string, encoding): return string.encode(encoding).decode('iso-8859-7') if codecs.lookup(environ['wsgi.uri_encoding']) != iso_8859_7: environ['PATH_INFO'] = redecode(environ['PATH_INFO'], environ['wsgi.uri_encoding']) environ['SCRIPT_NAME'] = redecode(environ['SCRIPT_NAME'], environ['wsgi.uri_encoding']) environ['wsgi.uri_encoding'] = 'iso-8859-7' This could be a part of the actual application if needing to be selective based on URLs, or as a WSGI middleware that can adjust it and which wraps the WSGI application. The other fallback is that a specific WSGI server could elect to provide an option to not use 'UTF-8' as the first choice for decoding and instead use a user supplied value via the WSGI servers configuration. Robert already showed as pseudo code what the WSGI server would do: try: decode_uri(userdefault or 'utf-8') except UnicodeDecodeError: decode_uri('iso-8859-1') For a pure Python WSGI server, which effectively only supports mounting at root of site, then this may apply to whole site. In Apache/mod_wsgi however, where using Location directive in Apache one can easily apply configuration to a sub set of URLs, one could be more selective. It gets more complicated when one talks about composition of disparate WSGI components as part of an application stack. Now, although having the configuration be done outside of the WSGI application and in the web server will not appeal to some, it still may be a useful fallback for where people don't want to have to fiddle with using WSGI middleware wrappers around their whole application or around individual components to do it. Anyway, there are multiple options here. Incidentally, if wsgi.uri_encoding is going to be the way to signal that the server has decoded bytes to characters using a known encoding, it should be stressed that this should only be set when that encoding is certain. That is, wsgi.uri_encoding should be omitted (or None?) in cases where another party has already decoded (and maybe mangled) the bytes using an unknown encoding. In particular, CGI. Yes, it is known that CGI and Python 3.X will be a problem. There has been a number of discussions which raised the CGI issues in the past. This time around we were possibly ignoring it for time being so that CGI script compatibility wasn't going to exclusively override us trying to make something that would work sanely for more up to date hosting methods. So, yes, having wsgi.uri_encoding be set to None for where not able to be determined what encoding is would be sensible. It may be the case that in such situations the only thing people can portably rely on is being able to use ASCII. If they know for sure what is used, they could set wsgi.uri_encoding themselves in a WSGI middleware wrapper around their application, or CGI/WSGI adapter could provide an option to allow user to set it so WSGI adapter uses user value but otherwise leaves the variables as they were. (In the case of Windows CGI the server will have decoded URI bytes into Unicode characters, using a charset which it is impossible to find out. In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF sequence, otherwise it's the system codepage. This problem affects the non-CGI implementation isapi_wsgi, too. Then the variables are read as environment variables, which for Python 2 means another encode/decode step on Windows using the system codepage, mangling non-codepage characters. Python 3 has the opposite problem reading byte envvars using
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Henry Precheur wrote: On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote: It looks simpler until you have a site that is not primarily utf-8. In that case, you multiply your (1 line * number of middlewares in the WSGI stack * each request). With wsgi.uri_encoding you get either (1 line * 1 middleware designed to transcode * each request), or even 0 if your whole site uses just one charset. I am not sure I understand your point. The 0 lines hold true if the whole site is using latin-1 or utf-8 and you write your applications/middlewares only for this site. But if it's using any other encoding you still have to transcode. def middleware(start_response, environ): value = environ['some_key'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) ... Yes; you have to transcode to the correct encoding. Once. Then every other WSGI application interface below that one doesn't have to care. With wsgi.uri_encoding you would still have to do the following: def middleware(start_response, environ): value = environ['some_key'].\ encode(environ['some_key.encoding']).\ decode(SITE_ENCODING) ... Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. The decoding doesn't change spontaneously. You either get the correct one or you get an incorrect one. If it's incorrect, you fix it, one time, via a WSGI component which you've configured to determine the correct decoding. Then every other WSGI component below that one can go back to trusting the decoding was correct. In fact, if you do that transcoding right away, no other WSGI components need to be rewritten to take advantage of unicode. You just have to deploy a single transcoder, that's 6 lines of code max. I know PJE will chime in here and say you can't deploy a website that works differently if you happen to forget to turn on a given piece of middleware, but I also know the rest of you will drown him out from personal experience because you've *never* done that. ;) With utf8+surrogateescape, you don't transcode once, you transcode in every WSGI component in your stack that needs to correct the decoding. You have to do it more than once because, each time you encode/re-decode, you use the result and then throw it away. Any subsequent WSGI components have to encode/re-decode--you cannot store the redecoded URI in SCRIPT_NAME/PATH_INFO, because the utf8+surrogateescape scheme says...well, it's always utf8-decoded. In addition, *every* component that needs to compare URI's then has to be configured with the same logic, however convoluted, to perform the correct decoding again. It's not just routing middleware: caches need to reliably compare decoded URI's; so do sessions; so does auth (especially!); so do static files. And Heaven forfend you actually decode differently in two different components! Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote: The decoding doesn't change spontaneously. You either get the correct one or you get an incorrect one. If it's incorrect, you fix it, one time, via a WSGI component which you've configured to determine the correct decoding. Then every other WSGI component below that one can go back to trusting the decoding was correct. In fact, if you do that transcoding right away, no other WSGI components need to be rewritten to take advantage of unicode. You just have to deploy a single transcoder, that's 6 lines of code max. And you can do that with utf8+surrogateescape too. Except that you don't have to determine what encoding the gateway sent you, it's always utf8+surrogateescape. With utf8+surrogateescape, you don't transcode once, you transcode in every WSGI component in your stack that needs to correct the decoding. You have to do it more than once because, each time you encode/re-decode, you use the result and then throw it away. Any subsequent WSGI components have to encode/re-decode--you cannot store the redecoded URI in SCRIPT_NAME/PATH_INFO, because the utf8+surrogateescape scheme says...well, it's always utf8-decoded. You don't get something REALLY important with surrogateescape: You can ALWAYS get the original bytes back. b = b'fran\xe7cois' s = b.decode('utf8', 'surrogateescape') s 'fran\udce7cois' s.encode('utf8', 'surrogateescape') b'fran\xe7cois' See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a normal UTF-8 character, this character use some 'free space' in the unicode supplementary characters. The only thing you have to do is to pass 'surrogateescape' each time you call encode/decode. In addition, *every* component that needs to compare URI's then has to be configured with the same logic, however convoluted, to perform the correct decoding again. It's not just routing middleware: caches need to reliably compare decoded URI's; so do sessions; so does auth (especially!); so do static files. And Heaven forfend you actually decode differently in two different components! I don't understand why I would need to throw away the decoded string. This works perfectly well a far as I know: environ['PATH_INFO'] = environ['PATH_INFO'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) utf8+surrogateescape provides the same possibilities as wsgi.uri_encoding. You can transcode without losing information when you know what the correct encoding is. But utf8+surrogateescape is simpler because there's no need to pass around the name of the encoding in an additional variable. -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 07:40 PM 9/21/2009 -0700, Robert Brewer wrote: Yes; you have to transcode to the correct encoding. Once. Then every other WSGI application interface below that one doesn't have to care. You can only do that if you *break encapsulation*, which as I said earlier is voiding the entire point of having a modular interface. Having a configurable encoding just means that *every* WSGI application *must* verify the encoding in order to be safe. I'm all in favor of making everyone suffer equally, but all else being equal, I'd prefer them to suffer idempotently rather than conditionally. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 07:21 PM 9/21/2009 -0700, Robert Brewer wrote: I've never proposed that WSGI make choices for people. I'm simply saying that a configurable server, with a sane, perfectly-reversible default, is the simplest thing that could possibly work. Actually, latin-1 bytes encoding is the *simplest* thing that could possibly work, since it works already in e.g. Jython, and is actually in the spec already... and any framework that wants unicode URIs already has to decode them, so the code is already written. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton graham.dumple...@gmail.com wrote: Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? For one, we aren't talking about arbitrary keys needing this treatment. We are only talking about SCRIPT_NAME and PATH_INFO. OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and introduce two equivalent variables that hold the NOT url-decoded values. So if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'. This will be quite disruptive, as these are variables that are frequently accessed directly (libraries that expose them as attributes can just turn them into properties that do URL decoding, using UTF8). But it's an easy fix at least. I would actually want to specify that if we added this key, we should disallow the old keys -- terrible confusion could ensue from both in the environ. This also fixes the problem with not being able to distinguish %2F from /, which isn't a big problem but is annoying, and is hiding meaningful information. (I believe the relevant spec does distinguish between these two values -- i.e., ideally decoding should happen on path segments, each segment separated by a real /.) If we do that, then the only really tricky thing left is HTTP_COOKIE, and since the Cookie header is a mess then HTTP_COOKIE will be a mess and we just have to figure out a hacky way to deal with that. Maybe surrogateescape, but probably just Latin1 would be fine (and easy to do in Python 2). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Ian Bicking i...@colorstudy.com: On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton graham.dumple...@gmail.com wrote: Of course you can directly use `environ['some_key']` if you know you'll get the 'right' encoding all the time. But when the encoding changes, you'll have to fix all your middlewares. I am missing something? For one, we aren't talking about arbitrary keys needing this treatment. We are only talking about SCRIPT_NAME and PATH_INFO. OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and introduce two equivalent variables that hold the NOT url-decoded values. So if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'. This will be quite disruptive, as these are variables that are frequently accessed directly (libraries that expose them as attributes can just turn them into properties that do URL decoding, using UTF8). But it's an easy fix at least. I would actually want to specify that if we added this key, we should disallow the old keys -- terrible confusion could ensue from both in the environ. This also fixes the problem with not being able to distinguish %2F from /, which isn't a big problem but is annoying, and is hiding meaningful information. (I believe the relevant spec does distinguish between these two values -- i.e., ideally decoding should happen on path segments, each segment separated by a real /.) If we do that, then the only really tricky thing left is HTTP_COOKIE, and since the Cookie header is a mess then HTTP_COOKIE will be a mess and we just have to figure out a hacky way to deal with that. Maybe surrogateescape, but probably just Latin1 would be fine (and easy to do in Python 2). That may be fine for pure Python web servers where you control the split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as that is done by the web server. Also, as pointed out in my blog, because of rewrites in web server, it may be difficult to try and map SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and reclaim original characters. There is also the problem that often FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and manual overrides needed to tweak them. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton graham.dumple...@gmail.com wrote: That may be fine for pure Python web servers where you control the split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as that is done by the web server. Also, as pointed out in my blog, because of rewrites in web server, it may be difficult to try and map SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and reclaim original characters. There is also the problem that often FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and manual overrides needed to tweak them. When things get messed up I recommend people use a middleware (paste.deploy.config.PrefixMiddleware, though I don't really care what they use) to fix up the request to be correct. Pulling it from REQUEST_URI would be fine. Also, at worst, you can do environ['SCRIPT_NAME_RAW'] = urllib.quote(environ.pop('SCRIPT_NAME')). It sucks, but if that's all the information you have, then that's all the information you have. Or try to get the information from REQUEST_URI the hard way, once at the gateway level. -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/22 Ian Bicking i...@colorstudy.com: On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton graham.dumple...@gmail.com wrote: That may be fine for pure Python web servers where you control the split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as that is done by the web server. Also, as pointed out in my blog, because of rewrites in web server, it may be difficult to try and map SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and reclaim original characters. There is also the problem that often FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and manual overrides needed to tweak them. When things get messed up I recommend people use a middleware (paste.deploy.config.PrefixMiddleware, though I don't really care what they use) to fix up the request to be correct. Pulling it from REQUEST_URI would be fine. Also, at worst, you can do environ['SCRIPT_NAME_RAW'] = urllib.quote(environ.pop('SCRIPT_NAME')). It sucks, but if that's all the information you have, then that's all the information you have. Or try to get the information from REQUEST_URI the hard way, once at the gateway level. Probably doable to just reverse it using underlying raw bytes. At least in mod_wsgi the SCRIPT_NAME/PATH_INFO split is always correct, unless people really screw it up by using WSGIScriptAliasMatch or AliasMatch wrongly. If doing something like you suggest, would prefer them as 'wsgi.' prefixed variables and not put in all upper case namespace to be confused with CGI variables etc. Graham Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
On Tue, Sep 22, 2009 at 12:38 AM, Graham Dumpleton graham.dumple...@gmail.com wrote: If doing something like you suggest, would prefer them as 'wsgi.' prefixed variables and not put in all upper case namespace to be confused with CGI variables etc. I just had to make up a name, but I agree with your suggestion for wsgi.X (we already have wsgi.url_scheme, after all). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 02:30 PM 9/22/2009 +1000, Graham Dumpleton wrote: Someone did say something about being able to half make it work on Python 2.X. Can someone properly provide example code for Python 2.X. The issue is that error handlers on encode are only allowed to provide substitute unicode characters, not substitute bytes. That's why it can only half work on 2.x. If we want uniformity in how interface works on Python 2.X and 3.X, they we have to be able to use same method without tricks. This is why wsgi.uri_encoding at the moment seems better, as not reliant on a feature only in Python 3.1+. If we want uniformity in the interface, then we should continue using latin-1, which already works today. Yes, it sucks, but it sucks *uniformly*. There really isn't going to be a solution that satisfies *all* of the criteria we're batting around, for *all* the users. What's happening is that the principals are focused on different scenarios, where all their criteria can be met at the expense of others'. I'm tending to flip-flop a bit myself, because my goal is that *nobody* wins, in the sense of having an advantaged framework, server, programming paradigm, etc. relative to others. And that means there are more ways of doing it that would be acceptable to me. For example, all bytes, all latin-1, all surrogateescape... I don't care all that terribly much between them, I just want it to be uniform for everybody using/implementing the spec. (And that also means I want it uniform across all keys, not just the URI ones; I don't want to have to remember which ones are special cases.) If some people need to do more code because of their particular codec requirements, that's okay by me, as long as it's *unconditional* code that doesn't depend on some sort of configuration rigamarole. That makes the spec brittle, because nobody's going to test their edge cases, and then the consumers of the code are gonna be the ones getting screwed over. Frankly, 90% of WSGI code written will never even check the wsgi.version number, so why would we think anybody's going to actually check wsgi.url_encoding? That's just building in the suck from day one. No offense intended to the proposer of it; it's a fine solution for a single project's API, but it's just not going to scale. We already know this, because most WSGI code written is not to spec. The ones of us here in the room talking about this are *not* good examples of average WSGI programmers, because (hopefully) we've all at least studied the spec and endeavored to fully grok and conform to it. (Hell, an unfortunately large number of people think you're supposed to use write() or yield to send *individual lines* of text.) So you better believe that everybody else is going to copy the worst available examples of other people's WSGI code and ignore any documentation associated with it... and then they will expect it to work on your server. ;-) Thus, our target audience is people who will rotely copy... which means we need an API they can either copy by rote, or know is wrong when they get an error message. Conditionals and error handling are too much to ask of them, as is remembering different rules for different environ keys that all kind of look alike. (There's a reason we required ALL_CAPS keys to be the same type in the first spec.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
[Web-SIG] Request for Comments on upcoming WSGI Changes
Hello everybody, Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ Neither the wording not the changes in there are anywhere near final. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. - unicode support can be added for WSGI on both Python 2.x and Python 3.x without removing functionality. Browsers are already doing a similar encoding trick as proposed by Graham Dumpleton to handle URLs. - Python 2.x already accepts unicode strings for many things such as URL handling thanks to the fact that unicode and byte strings are surprisingly interchangeable. - cgi.FieldStorage and some other parts is now totally broken on Python 3 and should no longer be used in 3.0 and 3.1 because it reads the response body into memory. This currently affects WebOb, Pylons and TurboGears. I sent this mail to every major framework / WSGI implementor so that we get input even if you're missing the discussion on web-sig. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote: Hello everybody, Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response Since there's already a well-established notion of WSGI 2.0 being the new calling convention, I would suggest (to avoid confusion) renaming your 2.0 to 1.2 or 1.5 or something instead. WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. This is unfortunate, but it should probably be considered a bellwether for Python 3 porting in general, alas. The Python 3 stdlib *should* work with bytes, and the fact that it does not should be treated as a bug in the stdlib rather than something to be worked around in WSGI. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? Technically, we are not using bytes but native strings, i.e. type 'str'. What benefit would introducing unicode produce? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? This discussion has been going on for so long that I've already forgotten what the problem was with just using the original 1.0 spec for 3.X, i.e., using native strings for everything, using latin-1 encoding. The only things I can recall off the top of my head are that the input stream would still be bytes, and that the environment might've used a different encoding. I don't know if such an approach should actually be *recommended*, but having a migration path for WSGI 1.0- Python 3.X sounds like a good idea, if it can be done strictly as errata/clarification of the existing spec. Otherwise, might as well forget the whole thing and go straight to the latest and greatest (i.e. what has previously been called 2.0 and you're calling 3.0.) I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? I suggest skipping straight to the latest and greatest with no in-betweens at all, other than errata/clarifications on 1.0. Having lots of variations of a standard is a bug, not a feature! The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, P.J. Eby schrieb: This discussion has been going on for so long that I've already forgotten what the problem was with just using the original 1.0 spec for 3.X, i.e., using native strings for everything, using latin-1 encoding. The only things I can recall off the top of my head are that the input stream would still be bytes, and that the environment might've used a different encoding. Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and many more technologies are based on unicode, even in Python 2.x. They are currently doing decoding of byte data internally. In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever we suddenly have different code paths for Python 3 and Python 2. Because in Python 3 we suddendly already have unicode data. You're assuming a situation where the applicaiton in Python 2.x was byte based, but in the majority of cases this is never the situation. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? I'm already creating a patch for urllib which currently requires unicode. I'm not sure about what to do with cgi.FieldStorage, in general I would not recommend using the cgi module for WSGI applications at all! If we would go with bytes for the WSGI 1.0 spec on Python 3 a WSGI server also has to decode that data from the Server again. Also (something I haven't yet filed as a bug because I guess there will be more changes involved) the HTTP server in Python 3.1 does not support non-ASCII headers. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
At 04:50 PM 9/20/2009 +0200, Armin Ronacher wrote: Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and many more technologies are based on unicode, even in Python 2.x. They are currently doing decoding of byte data internally. In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever we suddenly have different code paths for Python 3 and Python 2. Because in Python 3 we suddendly already have unicode data. No, you'd have bytes stored in a latin-1 string, which is not quite the same thing as already [having] unicode data. You have to .encode('latin1').decode(targetencoding) if you want genuine unicode data. If you're saying that people's code would have to change when they go to Python 3 (i.e., adding the extra .encode()), I think that's already a given for *any* non-trivial code, not just WSGI. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? I'm already creating a patch for urllib which currently requires unicode. I'm not sure about what to do with cgi.FieldStorage, in general I would not recommend using the cgi module for WSGI applications at all! But people do, in fact, use it for WSGI on 2.x, so if having different code paths is a problem, certainly dropping the cgi module is at least as big of a problem, if not considerably more so. I think one of the reasons that the current (and ongoing) PEP discussions have been foundering is that there isn't a clear delineation of goals at the high level, and rather just a bunch of tradeoff discussions, absent any criteria by which to make the tradeoffs. To me, I'd rather see people port to a new WSGI spec (with a new calling convention) on Python 2, and only *then* transition to Python 3. If we do that well, then the intermediate pain disappears -- as does the pain and complexity of trying to make a bastardized in-between specification. ;-) Truth be told, we can probably do that new spec *faster* if we don't have to worry too much about backward compatibility, and just design it for the way things are now, instead of worrying about the past. Even if we have to do some odd things inside a 2-to-1 converter, there should ideally only have to be a handful of such converters ever written. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby schrieb: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? FWIW, it's very much possible that the py3k stdlib is broken there. Many modules were ported with the aim get the test running again, and not too much thought about bytes/unicode issues. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
2009/9/21 Armin Ronacher armin.ronac...@active-4.com: IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? I'm already creating a patch for urllib which currently requires unicode. I'm not sure about what to do with cgi.FieldStorage, in general I would not recommend using the cgi module for WSGI applications at all! If we would go with bytes for the WSGI 1.0 spec on Python 3 a WSGI server also has to decode that data from the Server again. Also (something I haven't yet filed as a bug because I guess there will be more changes involved) the HTTP server in Python 3.1 does not support non-ASCII headers. Read the following first: http://bugs.python.org/issue4953 http://bugs.python.org/issue4661 There the ones I know about that affect cgi.FieldStorage. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Armin Ronacher wrote: Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ Neither the wording not the changes in there are anywhere near final. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? I'm happy either way, since CherryPy abstracts it all away. Decide already and I'll implement it. 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? +1 for skipping straight to unicode in Python 3. But call it 1.1 not 2.0. I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? No. We need more time to discuss and try to implement the large architectural changes in that. I need to ship CP 3.2 soon and would like it to have a better Python 3 story than the bytes-everywhere (or unicode pretending to be bytes) of WSGI 1.0. We have working code, which uses unicode in Python 3. Maybe I'll call it wsgi.version = (1, 'cp32') and let the spec come later if we can't see the trees for the forest. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby wrote: At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote: The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. IMO, this strongly suggests that it's the stdlib or Python 3 that's broken here. How much of the stdlib are we talking about needing to reimplement, aside from cgi.FieldStorage? urllib.unquote, for one. We had to make a version which accepts bytes (and outputs bytes). But it's only 8 lines of code. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
I'll try to digest some of this, currently I'm pretty clueless. Personally, I find it a bit hard to get excited about Python 3 as a web application deployment platform. This is of course a personal judgment (I don't mean to slight Python 3) but at this point, I'll think I'll probably be writing software that targets 2.X exclusively for at least the next five years. Given this point of view, it would be extremely helpful if someone could explain to people with the same outlook why we should want to deal with Unicode strings in any WSGI specification. WSGI is a fairly low-level protocol aimed at folks who need to interface a server to the outside world. The outside world (by its nature) talks bytes. I fear that any implied conversion of environment values and iterable return values to Unicode will actually eventually make things harder than they are now. I realize that it would make middleware implementors lives harder to need to deal in bytes. However, at this point, I also believe that middleware kinda should be hard. We have way too much middleware that shouldn't be middleware these days (some written by myself). Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to it will be compatible with Python 3. - C Armin Ronacher wrote: Hello everybody, Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ Neither the wording not the changes in there are anywhere near final. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. - unicode support can be added for WSGI on both Python 2.x and Python 3.x without removing functionality. Browsers are already doing a similar encoding trick as proposed by Graham Dumpleton to handle URLs. - Python 2.x already accepts unicode strings for many things such as URL handling thanks to the fact that unicode and byte strings are surprisingly interchangeable. - cgi.FieldStorage and some other parts is now totally broken on Python 3 and should no longer be used in 3.0 and 3.1 because it reads the response body into memory. This currently affects WebOb, Pylons and TurboGears. I sent this mail to every major framework / WSGI implementor so that we get input even if you're missing the discussion on web-sig. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
+1 On Sep 20, 2009, at 11:25 PM, Chris McDonough wrote: I'll try to digest some of this, currently I'm pretty clueless. Personally, I find it a bit hard to get excited about Python 3 as a web application deployment platform. This is of course a personal judgment (I don't mean to slight Python 3) but at this point, I'll think I'll probably be writing software that targets 2.X exclusively for at least the next five years. Given this point of view, it would be extremely helpful if someone could explain to people with the same outlook why we should want to deal with Unicode strings in any WSGI specification. WSGI is a fairly low-level protocol aimed at folks who need to interface a server to the outside world. The outside world (by its nature) talks bytes. I fear that any implied conversion of environment values and iterable return values to Unicode will actually eventually make things harder than they are now. I realize that it would make middleware implementors lives harder to need to deal in bytes. However, at this point, I also believe that middleware kinda should be hard. We have way too much middleware that shouldn't be middleware these days (some written by myself). Anyway, for us slower (and maybe wrongly fearful) folks, could someone summarize the benefits of having a WSGI specification that requires Unicode. Bonus points for an explanation that does not boil down to it will be compatible with Python 3. - C Armin Ronacher wrote: Hello everybody, Thanks to Graham Dumpleton and Robert Brewer there is some serious progress on WSGI currently. I proposed a roadmap with some PEP changes now that need some input. Summary: WSGI 1.0 stays the same as PEP 0333 currently is WSGI 1.1 becomes what Ian and I added to PEP 0333 WSGI 2.0 becomes a unicode powered version of WSGI 1.1 WSGI 3.0 becomes WSGI 2.0 just without start_response WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python 3 because of changes in the standard library that no longer work with a byte-only approach. The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/ Neither the wording not the changes in there are anywhere near final. Graham wrote down two questions he wants every major framework developer to be answered. These should guide the way to new WSGI standards: 1. Do we keep bytes everywhere forever in Python 2.X, or try to introduce unicode there at all to at least mirror what changes might be made to make WSGI workable in Python 3.X? 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to WSGI 2.0 for Python 3.X? I added a new question I think should be asked too: 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to WSGI 3.0 and drop start_response? The following things became pretty clear when playing around with various specifications on Python 3: - Python 3 no longer implicitly converts between unicode and byte strings. This covers comparisons, the regular expression engine, all string functions and many modules in the stdlib. - The Python 3 stdlib radically moved to unicode for non unicode things as well (the http servers, http clients, url handling etc.) - A byte only version of WSGI appears unrealistic on Python 3 because it would require server and middleware implementors to reimplement parts of the standard library to work on bytes again. - unicode support can be added for WSGI on both Python 2.x and Python 3.x without removing functionality. Browsers are already doing a similar encoding trick as proposed by Graham Dumpleton to handle URLs. - Python 2.x already accepts unicode strings for many things such as URL handling thanks to the fact that unicode and byte strings are surprisingly interchangeable. - cgi.FieldStorage and some other parts is now totally broken on Python 3 and should no longer be used in 3.0 and 3.1 because it reads the response body into memory. This currently affects WebOb, Pylons and TurboGears. I sent this mail to every major framework / WSGI implementor so that we get input even if you're missing the discussion on web-sig. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/mdipierro%40cs.depaul.edu ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Hi, Chris McDonough schrieb: Personally, I find it a bit hard to get excited about Python 3 as a web application deployment platform. Everybody feels that way currently. But if we don't fix WSGI that will never change. Given this point of view, it would be extremely helpful if someone could explain to people with the same outlook why we should want to deal with Unicode strings in any WSGI specification. I summarized the reasons in my mail. Also have a look at the discussions in this mailinglist that lead to that. Regards, Armin ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com