Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-24 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

P.J. Eby wrote:

 So you better believe that everybody else is going to copy the worst 
 available examples of other people's WSGI code and ignore any 
 documentation associated with it...  and then they will expect it to 
 work on your server.  ;-)

Amen to that, brother Phil!  The milion monkeys effect is massively
amplified by the ease of cut-and-paste in modern editors (Im my day, we
used 'ed' or 'cat'  you kids get off my grass!)


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKvCQ7+gerLs4ltQ4RAvh4AJ0ZAkrqDWQKZ1Qecm2X6tYOsqpFYACgkveA
JcuYoYhpPgk6fByC7XQ82aI=
=LvgU
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-23 Thread Philip Jenvey


On Sep 22, 2009, at 8:28 PM, P.J. Eby wrote:


At 05:12 PM 9/22/2009 -0700, Philip Jenvey wrote:

Because our request container is a plain, pre-fabricated dict that
doesn't permit the lazy behavior.


Not quite true; you can always write a library function,  
get_foo(environ) that does the lazy caching in a private environ  
key, at the cost of also caching the original value and doing a  
consistency check.


Sure, that's what the Werkzeug et al WSGI 1 wrappers are already  
doing, I'm referring to the Python 3 WSGI level itself, assuming it  
returns latin1 decoded native strs. You're talking about a separate  
process on top of WSGI -- this process becomes an unnecessary  
roundtrip compared to the WSGI 1 wrappers, as Armin points out.


--
Philip Jenvey
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Mark Nottingham

OK, that's quite exhaustive.

For the benefit of those of us jumping in, could you summarise your  
proposal in something like the following manner:


1. How the request method is made available to WSGI applications
2. How the request-uri is made available to WSGI applications -- in  
particular, whether any decoding of punycode and/or %-escapes happens

3. How request headers are made available to WSGI apps
4. How the request body is made available to to WSGI apps
5. Likewise for how apps should expose the response status message,  
headers and body to WSGI implementations.


Cheers,


On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:

Reference?


See:

 http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

Anyone else jumping in on this conversation with their own opinions
and who has not read it, should perhaps at least read that. Also read
some of the earlier posts in the numerous discussions this spawned at:

 http://groups.google.com/group/python-web-sig?lnk=

as the current thinking isn't exactly what I blogged about and has
shifted a bit as the discussion has progressed.

Graham


On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Most things is not the Web. How will you handle serving images  
through

WSGI?
Compressed content?  PDFs?


You are perhaps misunderstanding something. A WSGI application still
should return bytes.

The whole concept of any sort of fallback to allow unicode data to  
be
returned for response content was purely so the canonical hello  
world

application as per Python 2.X could still be used on Python 3.X.

So, we aren't saying that the only thing WSGI applications can  
return

is unicode strings for response content.

Have you read my original blog post that triggered all this  
discussion

this time around?

Graham


On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
 Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most  
things
layered on top of wsgi are using utf-8 (django etc), and lots of  
web

clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:

http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com




--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Graham Dumpleton
2009/9/22 P.J. Eby p...@telecommunity.com:
 I'm tending to flip-flop a bit myself

For the record, I am doing that as well.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 OK, that's quite exhaustive.

 For the benefit of those of us jumping in, could you summarise your proposal
 in something like the following manner:

 1. How the request method is made available to WSGI applications
 2. How the request-uri is made available to WSGI applications -- in
 particular, whether any decoding of punycode and/or %-escapes happens
 3. How request headers are made available to WSGI apps
 4. How the request body is made available to to WSGI apps
 5. Likewise for how apps should expose the response status message, headers
 and body to WSGI implementations.

Same as the WSGI PEP.

  http://www.python.org/dev/peps/pep-0333/

Nothing has changed in that respect.

Graham

 Cheers,


 On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Reference?

 See:


  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

 Anyone else jumping in on this conversation with their own opinions
 and who has not read it, should perhaps at least read that. Also read
 some of the earlier posts in the numerous discussions this spawned at:

  http://groups.google.com/group/python-web-sig?lnk=

 as the current thinking isn't exactly what I blogged about and has
 shifted a bit as the discussion has progressed.

 Graham

 On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Most things is not the Web. How will you handle serving images through
 WSGI?
 Compressed content?  PDFs?

 You are perhaps misunderstanding something. A WSGI application still
 should return bytes.

 The whole concept of any sort of fallback to allow unicode data to be
 returned for response content was purely so the canonical hello world
 application as per Python 2.X could still be used on Python 3.X.

 So, we aren't saying that the only thing WSGI applications can return
 is unicode strings for response content.

 Have you read my original blog post that triggered all this discussion
 this time around?

 Graham

 On 22/09/2009, at 1:30 AM, René Dudfield wrote:

 here is a summary:
  Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most things
 layered on top of wsgi are using utf-8 (django etc), and lots of web
 clients are using utf-8 (firefox etc).

 Why not move to unicode?


 --
 Mark Nottingham     http://www.mnot.net/

 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:


 http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com



 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Mark Nottingham
So, what advice do you propose about decoding bytes into strings for  
the request-URI / method / request headers, and vice versa for  
response headers and status code/phrase? Do you assume ASCII, Latin-1,  
or UTF-8? How are errors handled?


Are bodies still treated as binary byte sequences, as per PEP 333?

Cheers,

On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:

OK, that's quite exhaustive.

For the benefit of those of us jumping in, could you summarise your  
proposal

in something like the following manner:

1. How the request method is made available to WSGI applications
2. How the request-uri is made available to WSGI applications -- in
particular, whether any decoding of punycode and/or %-escapes happens
3. How request headers are made available to WSGI apps
4. How the request body is made available to to WSGI apps
5. Likewise for how apps should expose the response status message,  
headers

and body to WSGI implementations.


Same as the WSGI PEP.

 http://www.python.org/dev/peps/pep-0333/

Nothing has changed in that respect.

Graham


Cheers,


On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Reference?


See:


 http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

Anyone else jumping in on this conversation with their own opinions
and who has not read it, should perhaps at least read that. Also  
read
some of the earlier posts in the numerous discussions this spawned  
at:


 http://groups.google.com/group/python-web-sig?lnk=

as the current thinking isn't exactly what I blogged about and has
shifted a bit as the discussion has progressed.

Graham


On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Most things is not the Web. How will you handle serving images  
through

WSGI?
Compressed content?  PDFs?


You are perhaps misunderstanding something. A WSGI application  
still

should return bytes.

The whole concept of any sort of fallback to allow unicode data  
to be
returned for response content was purely so the canonical hello  
world

application as per Python 2.X could still be used on Python 3.X.

So, we aren't saying that the only thing WSGI applications can  
return

is unicode strings for response content.

Have you read my original blog post that triggered all this  
discussion

this time around?

Graham


On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
 Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most  
things
layered on top of wsgi are using utf-8 (django etc), and lots  
of web

clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:


http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com




--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 So, what advice do you propose about decoding bytes into strings for the
 request-URI / method / request headers, and vice versa for response headers
 and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are
 errors handled?

 Are bodies still treated as binary byte sequences, as per PEP 333?

I thought my blog post explained that reasonably well. Ensure you read
the numbered definitions.

If you can't work it out from the blog, point at the specific thing in
the blog you don't understand and can help. Don't really want to go
explaining it all again.

Graham

 Cheers,

 On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 OK, that's quite exhaustive.

 For the benefit of those of us jumping in, could you summarise your
 proposal
 in something like the following manner:

 1. How the request method is made available to WSGI applications
 2. How the request-uri is made available to WSGI applications -- in
 particular, whether any decoding of punycode and/or %-escapes happens
 3. How request headers are made available to WSGI apps
 4. How the request body is made available to to WSGI apps
 5. Likewise for how apps should expose the response status message,
 headers
 and body to WSGI implementations.

 Same as the WSGI PEP.

  http://www.python.org/dev/peps/pep-0333/

 Nothing has changed in that respect.

 Graham

 Cheers,


 On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Reference?

 See:



  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

 Anyone else jumping in on this conversation with their own opinions
 and who has not read it, should perhaps at least read that. Also read
 some of the earlier posts in the numerous discussions this spawned at:

  http://groups.google.com/group/python-web-sig?lnk=

 as the current thinking isn't exactly what I blogged about and has
 shifted a bit as the discussion has progressed.

 Graham

 On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Most things is not the Web. How will you handle serving images
 through
 WSGI?
 Compressed content?  PDFs?

 You are perhaps misunderstanding something. A WSGI application still
 should return bytes.

 The whole concept of any sort of fallback to allow unicode data to be
 returned for response content was purely so the canonical hello world
 application as per Python 2.X could still be used on Python 3.X.

 So, we aren't saying that the only thing WSGI applications can return
 is unicode strings for response content.

 Have you read my original blog post that triggered all this discussion
 this time around?

 Graham

 On 22/09/2009, at 1:30 AM, René Dudfield wrote:

 here is a summary:
  Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most things
 layered on top of wsgi are using utf-8 (django etc), and lots of web
 clients are using utf-8 (firefox etc).

 Why not move to unicode?


 --
 Mark Nottingham     http://www.mnot.net/

 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:



 http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com



 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Mark Nottingham
That blog entry is eleven printed pages. Given that PEP 333 also  
prints as eleven pages from my browser, I suspect there's some  
extraneous information in there.


Could you please summarise? Requiring all comers to read such a  
voluminous entry is a considerable (and somewhat arbitrary) bar to  
entry for the discussion.


Thanks,


On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:
So, what advice do you propose about decoding bytes into strings  
for the
request-URI / method / request headers, and vice versa for response  
headers
and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How  
are

errors handled?

Are bodies still treated as binary byte sequences, as per PEP 333?


I thought my blog post explained that reasonably well. Ensure you read
the numbered definitions.

If you can't work it out from the blog, point at the specific thing in
the blog you don't understand and can help. Don't really want to go
explaining it all again.

Graham


Cheers,

On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


OK, that's quite exhaustive.

For the benefit of those of us jumping in, could you summarise your
proposal
in something like the following manner:

1. How the request method is made available to WSGI applications
2. How the request-uri is made available to WSGI applications -- in
particular, whether any decoding of punycode and/or %-escapes  
happens

3. How request headers are made available to WSGI apps
4. How the request body is made available to to WSGI apps
5. Likewise for how apps should expose the response status message,
headers
and body to WSGI implementations.


Same as the WSGI PEP.

 http://www.python.org/dev/peps/pep-0333/

Nothing has changed in that respect.

Graham


Cheers,


On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Reference?


See:



 http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

Anyone else jumping in on this conversation with their own  
opinions
and who has not read it, should perhaps at least read that. Also  
read
some of the earlier posts in the numerous discussions this  
spawned at:


 http://groups.google.com/group/python-web-sig?lnk=

as the current thinking isn't exactly what I blogged about and has
shifted a bit as the discussion has progressed.

Graham


On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Most things is not the Web. How will you handle serving images
through
WSGI?
Compressed content?  PDFs?


You are perhaps misunderstanding something. A WSGI application  
still

should return bytes.

The whole concept of any sort of fallback to allow unicode  
data to be
returned for response content was purely so the canonical  
hello world

application as per Python 2.X could still be used on Python 3.X.

So, we aren't saying that the only thing WSGI applications can  
return

is unicode strings for response content.

Have you read my original blog post that triggered all this  
discussion

this time around?

Graham


On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
 Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.   
Most things
layered on top of wsgi are using utf-8 (django etc), and  
lots of web

clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:



http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com




--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Ian Bicking
It's not a specific proposal, but here's my opinions on what a proposal
should be:

On Tue, Sep 22, 2009 at 1:06 AM, Mark Nottingham m...@mnot.net wrote:

 OK, that's quite exhaustive.

 For the benefit of those of us jumping in, could you summarise your
 proposal in something like the following manner:

 1. How the request method is made available to WSGI applications


Graham talked about it as bytes/unicode/native, where native is unicode on
Python 3 and str on Python 2.  For instance, I think there's general
consensus (though not really specifically discussed) that environ keys
should be native.

I think method should be native.


 2. How the request-uri is made available to WSGI applications -- in
 particular, whether any decoding of punycode and/or %-escapes happens


Hah, didn't even think about de-punycoding HTTP_HOST.  That'd be a blast.

I think:
* scheme as native
* HTTP_HOST as native (no decoding of punycode)
* path as native (no URL decoding) - big break with WSGI 1 and CGI, but what
the hell.  I could easily waffle on this.
* query string as native - *should* be ASCII-safe currently.

Wow, that was easy!

Request headers, which you didn't split out... those I'm not sure.  I'd
*like* them to be native.  But damn, I'm just not sure quite how.
surrogateescape?  Latin1?  Latin1 as a kind of poor man's surrogateescape
isn't so bad.  And the headers *should* be ASCII for sane requests, so it's
not a horrible compromise.  I guess libraries could lazilly transcode, just
like they currently lazily decode.  But it'd be a bit obnoxious at the
library level.  Transcoding middleware would be easier, but it adds the
question of how to record that the transcoding has taken place.


 3. How request headers are made available to WSGI apps


Request handlers?  I don't understand your terminology.


 4. How the request body is made available to to WSGI apps


Ugh.  wsgi.input could remain.  I think at least it should become a
file-like interface (i.e., giving an empty string when the content is
exausted) and I might even ask that it implement .tell() (.seek() would be
nice of course, but optional).  If there was some other idea, I think
there's room for improvement on wsgi.input and the file interface.

wsgi.input should definitely work with bytes only.  I believe this is
consensus.


 5. Likewise for how apps should expose the response status message, headers
 and body to WSGI implementations.


I believe there is consensus that the response body should remain an
iterator that yields bytes.

In one way, it'd be nice if we'd just say that status/headers should be
ASCII, because that's the reasonable choice.  But for proxying or
representing HTTP as it is, it's not always the case.  And I'm committed
to keeping WSGI fully capable of representing arbitrary requests and
responses so long as they aren't entirely diabololical.

But, an ASCII status is not unreasonable, especially since there's zero
semantic meaning to the reason.  Which makes native strings perfectly fine.

So, headers...

Well, Latin1 is easy enough.  In theory, or at least particular theories,
headers can be Latin1.  And you can represent arbitrary bytes that way.  So
if you want to send crazy stuff to the browser, you can do it that way.  And
if you want to stick to plain ASCII then that's easy enough as well.  So...
native?  str or unicode?  I'm not sure specifically for this one.


-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 That blog entry is eleven printed pages. Given that PEP 333 also prints as
 eleven pages from my browser, I suspect there's some extraneous information
 in there.

 Could you please summarise? Requiring all comers to read such a voluminous
 entry is a considerable (and somewhat arbitrary) bar to entry for the
 discussion.

If you aren't willing to read the PEP to understand WSGI why are you
even wanting to participate in the discussion in the first place? This
is a quite detailed discussion about the future of the WSGI
specification and not an IRC channel manned by ticket monkeys. :-(

Graham

 Thanks,


 On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 So, what advice do you propose about decoding bytes into strings for the
 request-URI / method / request headers, and vice versa for response
 headers
 and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are
 errors handled?

 Are bodies still treated as binary byte sequences, as per PEP 333?

 I thought my blog post explained that reasonably well. Ensure you read
 the numbered definitions.

 If you can't work it out from the blog, point at the specific thing in
 the blog you don't understand and can help. Don't really want to go
 explaining it all again.

 Graham

 Cheers,

 On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 OK, that's quite exhaustive.

 For the benefit of those of us jumping in, could you summarise your
 proposal
 in something like the following manner:

 1. How the request method is made available to WSGI applications
 2. How the request-uri is made available to WSGI applications -- in
 particular, whether any decoding of punycode and/or %-escapes happens
 3. How request headers are made available to WSGI apps
 4. How the request body is made available to to WSGI apps
 5. Likewise for how apps should expose the response status message,
 headers
 and body to WSGI implementations.

 Same as the WSGI PEP.

  http://www.python.org/dev/peps/pep-0333/

 Nothing has changed in that respect.

 Graham

 Cheers,


 On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Reference?

 See:




  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

 Anyone else jumping in on this conversation with their own opinions
 and who has not read it, should perhaps at least read that. Also read
 some of the earlier posts in the numerous discussions this spawned at:

  http://groups.google.com/group/python-web-sig?lnk=

 as the current thinking isn't exactly what I blogged about and has
 shifted a bit as the discussion has progressed.

 Graham

 On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Most things is not the Web. How will you handle serving images
 through
 WSGI?
 Compressed content?  PDFs?

 You are perhaps misunderstanding something. A WSGI application still
 should return bytes.

 The whole concept of any sort of fallback to allow unicode data to
 be
 returned for response content was purely so the canonical hello
 world
 application as per Python 2.X could still be used on Python 3.X.

 So, we aren't saying that the only thing WSGI applications can
 return
 is unicode strings for response content.

 Have you read my original blog post that triggered all this
 discussion
 this time around?

 Graham

 On 22/09/2009, at 1:30 AM, René Dudfield wrote:

 here is a summary:
  Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most
 things
 layered on top of wsgi are using utf-8 (django etc), and lots of
 web
 clients are using utf-8 (firefox etc).

 Why not move to unicode?


 --
 Mark Nottingham     http://www.mnot.net/

 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:




 http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com



 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/




 --
 Mark Nottingham     http://www.mnot.net/


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Mark Nottingham
You're twisting my words; nowhere did I say i wasn't willing to read  
the PEP. What I did say was that a proposal can and should be made in  
less than eleven pages; I'd like to give my feedback, both because I  
use Python and because I have some interest in HTTP. However, my time  
is limited, and I already have a stack of other things to review on my  
desk.


He who writes the most words does not (hopefully, for the sake of the  
Python community) win. I appreciate that you've taken the time to  
reason out a proposal, but the minutia of how you got to that place  
should not obscure the proposal itself.


I'm not sure how to take your ticket monkeys comment, so I'll ignore  
it.




On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:
That blog entry is eleven printed pages. Given that PEP 333 also  
prints as
eleven pages from my browser, I suspect there's some extraneous  
information

in there.

Could you please summarise? Requiring all comers to read such a  
voluminous

entry is a considerable (and somewhat arbitrary) bar to entry for the
discussion.


If you aren't willing to read the PEP to understand WSGI why are you
even wanting to participate in the discussion in the first place? This
is a quite detailed discussion about the future of the WSGI
specification and not an IRC channel manned by ticket monkeys. :-(

Graham


Thanks,


On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


So, what advice do you propose about decoding bytes into strings  
for the

request-URI / method / request headers, and vice versa for response
headers
and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8?  
How are

errors handled?

Are bodies still treated as binary byte sequences, as per PEP  
333?


I thought my blog post explained that reasonably well. Ensure you  
read

the numbered definitions.

If you can't work it out from the blog, point at the specific  
thing in

the blog you don't understand and can help. Don't really want to go
explaining it all again.

Graham


Cheers,

On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


OK, that's quite exhaustive.

For the benefit of those of us jumping in, could you summarise  
your

proposal
in something like the following manner:

1. How the request method is made available to WSGI applications
2. How the request-uri is made available to WSGI applications  
-- in
particular, whether any decoding of punycode and/or %-escapes  
happens

3. How request headers are made available to WSGI apps
4. How the request body is made available to to WSGI apps
5. Likewise for how apps should expose the response status  
message,

headers
and body to WSGI implementations.


Same as the WSGI PEP.

 http://www.python.org/dev/peps/pep-0333/

Nothing has changed in that respect.

Graham


Cheers,


On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Reference?


See:




 http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

Anyone else jumping in on this conversation with their own  
opinions
and who has not read it, should perhaps at least read that.  
Also read
some of the earlier posts in the numerous discussions this  
spawned at:


 http://groups.google.com/group/python-web-sig?lnk=

as the current thinking isn't exactly what I blogged about and  
has

shifted a bit as the discussion has progressed.

Graham


On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:


Most things is not the Web. How will you handle serving  
images

through
WSGI?
Compressed content?  PDFs?


You are perhaps misunderstanding something. A WSGI  
application still

should return bytes.

The whole concept of any sort of fallback to allow unicode  
data to

be
returned for response content was purely so the canonical  
hello

world
application as per Python 2.X could still be used on Python  
3.X.


So, we aren't saying that the only thing WSGI applications can
return
is unicode strings for response content.

Have you read my original blog post that triggered all this
discussion
this time around?

Graham


On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
 Apart from python3 compatibility(which should be good  
enough
reason), utf-8 is what's used in http a lot these days.   
Most

things
layered on top of wsgi are using utf-8 (django etc), and  
lots of

web
clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:




http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com




--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/





--
Mark Nottingham http://www.mnot.net/






Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Ian]
 OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO
 introduce two equivalent variables that hold the NOT url-decoded values.

[Graham]
 That may be fine for pure Python web servers where you control the
 split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place
 but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as
 that is done by the web server. Also, as pointed out in my blog,
 because of rewrites in web server, it may be difficult to try and map
 SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and
 reclaim original characters. There is also the problem that often
 FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and
 manual overrides needed to tweak them.

This applies doubly under Java servlets, where different containers
take different approaches to solve these rather hard problems. It is
worth noting that they have to do so because the java servlet spec,
even under the most recent 2.5,  punts on *all* of the issues being
discussed here.

See here for how Tomcat does it. Or half does it, messily.

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

I know this is not helpful ;-)

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Ian]
 When things get messed up I recommend people use a middleware
 (paste.deploy.config.PrefixMiddleware, though I don't really care what they
 use) to fix up the request to be correct.  Pulling it from REQUEST_URI would
 be fine.

That would be unworkable under java servlet containers, since they
each take a different approach to addressing encoding issues, or fail
to deal with them entirely.

So there would probably have to be a special case for every single one of these

http://en.wikipedia.org/wiki/List_of_Servlet_containers

Each of which has a number of different ways of being configured in
relation to these issues.

I don't know if it would even be possible to write such a middleware.

And retain all of one's hair.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 You're twisting my words; nowhere did I say i wasn't willing to read the
 PEP. What I did say was that a proposal can and should be made in less than
 eleven pages; I'd like to give my feedback, both because I use Python and
 because I have some interest in HTTP. However, my time is limited, and I
 already have a stack of other things to review on my desk.

 He who writes the most words does not (hopefully, for the sake of the Python
 community) win. I appreciate that you've taken the time to reason out a
 proposal, but the minutia of how you got to that place should not obscure
 the proposal itself.

 I'm not sure how to take your ticket monkeys comment, so I'll ignore it.

Sorry if I come across as being short.

None of us has time and this whole WSGI on Python 3.0 issue has been
going on since start of last year. Many of us are quite tired of it
all. I also don't personally know who you are, not recollecting seeing
your name in any past discussions. I am told though you were involved
back at time of original WSGI specification drafting, so apologies.

The ticket monkeys reference is just the allusion to a help desk. I
always think of what happens when people jump on IRC as being worst
case. That is, they treat people there like help desk staff who only
exist to serve them and not anyone else. So, you see people who have a
complex problem, pose a question in a single line. They then expect a
even more complex solution to there problem, usually expressed in one
line again.

There is a book I have been meaning to read called the 'Trusted
Advisor' which apparently goes on about providing assistance to others
as comparing the idea of being like a ticket monkey (help desk),
versus building a relationship with people in order to understand
their real issues and provide better solutions. Obviously being an
advisor rather than a help desk is ultimately going to be better for
the people needing help, but if the customer has the frame of mind
that you are just the help desk and don't want to put any effort into
the relationship, it is hard to try and be that advisor.

So, I felt a bit like a help desk in the way I interpreted your comments.

Graham

 On 22/09/2009, at 4:44 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 That blog entry is eleven printed pages. Given that PEP 333 also prints
 as
 eleven pages from my browser, I suspect there's some extraneous
 information
 in there.

 Could you please summarise? Requiring all comers to read such a
 voluminous
 entry is a considerable (and somewhat arbitrary) bar to entry for the
 discussion.

 If you aren't willing to read the PEP to understand WSGI why are you
 even wanting to participate in the discussion in the first place? This
 is a quite detailed discussion about the future of the WSGI
 specification and not an IRC channel manned by ticket monkeys. :-(

 Graham

 Thanks,


 On 22/09/2009, at 4:36 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 So, what advice do you propose about decoding bytes into strings for
 the
 request-URI / method / request headers, and vice versa for response
 headers
 and status code/phrase? Do you assume ASCII, Latin-1, or UTF-8? How are
 errors handled?

 Are bodies still treated as binary byte sequences, as per PEP 333?

 I thought my blog post explained that reasonably well. Ensure you read
 the numbered definitions.

 If you can't work it out from the blog, point at the specific thing in
 the blog you don't understand and can help. Don't really want to go
 explaining it all again.

 Graham

 Cheers,

 On 22/09/2009, at 4:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 OK, that's quite exhaustive.

 For the benefit of those of us jumping in, could you summarise your
 proposal
 in something like the following manner:

 1. How the request method is made available to WSGI applications
 2. How the request-uri is made available to WSGI applications -- in
 particular, whether any decoding of punycode and/or %-escapes happens
 3. How request headers are made available to WSGI apps
 4. How the request body is made available to to WSGI apps
 5. Likewise for how apps should expose the response status message,
 headers
 and body to WSGI implementations.

 Same as the WSGI PEP.

  http://www.python.org/dev/peps/pep-0333/

 Nothing has changed in that respect.

 Graham

 Cheers,


 On 22/09/2009, at 12:26 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Reference?

 See:





  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

 Anyone else jumping in on this conversation with their own opinions
 and who has not read it, should perhaps at least read that. Also
 read
 some of the earlier posts in the numerous discussions this spawned
 at:

  http://groups.google.com/group/python-web-sig?lnk=

 as the current thinking isn't exactly what I blogged about and has
 shifted a bit 

Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Armin Ronacher
Hi,

P.J. Eby schrieb:
 Actually, latin-1 bytes encoding is the *simplest* thing that could 
 possibly work, since it works already in e.g. Jython, and is actually 
 in the spec already...  and any framework that wants unicode URIs 
 already has to decode them, so the code is already written. 
Except that nobody implements that and that Jython has a standard Python
2.x byte string.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Mark Nottingham


On 22/09/2009, at 6:11 PM, Armin Ronacher wrote:


Hi,

Mark Nottingham schrieb:

HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but
HTTPbis currently takes the stance that they're ASCII, as in practice
Latin-1 isn't used and may introduce interop problems.

In practise non-ascii data ends up in headers.


Yes. However, it shouldn't be encouraged.




What does it mean to support non-ASCII headers? As per above, the
only sane thing to do is treat them as opaque data, because you can't
be certain of their encoding unless you have knowledge of the header.

Here what http.server does in Python 3 (actual code):

   def send_header(self, keyword, value):
   Send a MIME header.
   if self.request_version != 'HTTP/0.9':
   self.wfile.write((%s: %s\r\n % (keyword,
 value)).encode('ASCII', 'strict'))

   if keyword.lower() == 'connection':
   if value.lower() == 'close':
   self.close_connection = 1
   elif value.lower() == 'keep-alive':
   self.close_connection = 0

So it will give you a nice UnicodeEncodeError if you try to send
anything outside of the ASCII range as header.



Ouch.


--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Armin Ronacher
Hi,

Ian Bicking schrieb:
 Request headers, which you didn't split out... those I'm not sure.  I'd
 *like* them to be native.  But damn, I'm just not sure quite how.
 surrogateescape?  Latin1?  Latin1 as a kind of poor man's surrogateescape
 isn't so bad.  And the headers *should* be ASCII for sane requests, so it's
 not a horrible compromise.
Except for cookie headers.  Thanks to advertising and all the other
system putting headers on your page you can't even properly control that
one.

Another thing to consider: in Python 3.1, the HTTP server internally
decodes to latin1 and there is no simple way to change that, unless you
replace the implementation.

 Ugh.  wsgi.input could remain.  I think at least it should become a
 file-like interface (i.e., giving an empty string when the content is
 exausted) and I might even ask that it implement .tell() (.seek() would be
 nice of course, but optional).  If there was some other idea, I think
 there's room for improvement on wsgi.input and the file interface.
-1 on seek and tell.  This could be impossible to implement and what we
really want to do is to not have the data in memory but on disk or
whereever you put big-ass uploads.  Also it will be hard to test for an
avaiable seek or not, because even if it's a noop, the method could be
there.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[P.J. Eby]
 Actually, latin-1 bytes encoding is the *simplest* thing that could
 possibly work, since it works already in e.g. Jython, and is actually
 in the spec already...  and any framework that wants unicode URIs
 already has to decode them, so the code is already written.

[Armin]
 Except that nobody implements that

So, if nobody implements that, then why are we trying to standardise it?

Is there a real need out there?

Or are all these discussions solely driven by the need/desire to have
only unicode strings in the WSGI dictionary under python 3?

Which is a worthy goal, IMHO. Java has been there since the very
start, since java strings have always been unicode. Take a look at the
java docs for HttpServlet: no methods return bytes/bytearrays.

http://java.sun.com/products/servlet/2.5/docs/servlet-2_5-mr2/javax/servlet/http/HttpServletRequest.html

But the java servlet spec still ignores *all* of the encoding concerns
being discussed here. Which means that mistakes/mojibake must happen
all the time. And it's up to the author of the individual java web
application to solve those problems, using a mechanism appropriate for
their needs and local environment.

Java programmers just tolerate this, although they may curse the
developers of the servlet spec for not having solved their specific
problem for them.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Ian Bicking
On Tue, Sep 22, 2009 at 3:16 AM, Armin Ronacher armin.ronac...@active-4.com
 wrote:

 Hi,

 Ian Bicking schrieb:
  Request headers, which you didn't split out... those I'm not sure.  I'd
  *like* them to be native.  But damn, I'm just not sure quite how.
  surrogateescape?  Latin1?  Latin1 as a kind of poor man's surrogateescape
  isn't so bad.  And the headers *should* be ASCII for sane requests, so
 it's
  not a horrible compromise.
 Except for cookie headers.  Thanks to advertising and all the other
 system putting headers on your page you can't even properly control that
 one.


Yes, but it'd be relatively easy to handle this, especially since the raw
header isn't very useful.  So you just do
environ['HTTP_COOKIE'].encode('latin1').decode('utf8', 'replace') before
parsing.

Another thing to consider: in Python 3.1, the HTTP server internally
 decodes to latin1 and there is no simple way to change that, unless you
 replace the implementation.

  Ugh.  wsgi.input could remain.  I think at least it should become a
  file-like interface (i.e., giving an empty string when the content is
  exausted) and I might even ask that it implement .tell() (.seek() would
 be
  nice of course, but optional).  If there was some other idea, I think
  there's room for improvement on wsgi.input and the file interface.
 -1 on seek and tell.  This could be impossible to implement and what we
 really want to do is to not have the data in memory but on disk or
 whereever you put big-ass uploads.  Also it will be hard to test for an
 avaiable seek or not, because even if it's a noop, the method could be
 there.


Tell doesn't have particular overhead except to keep track of how many bytes
have been read.  That would allow libraries to at least detect contention
for wsgi.input.  I wish seek were detectable, though I agree it shouldn't be
required at all.

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Armin Ronacher
Hi,

Alan Kennedy schrieb:
 So, if nobody implements that, then why are we trying to standardise it?
I think that was just one of the ideas that were discussed.

Just to sum it up a bit where we already went:

- my initial plan was going bytes everywhere.  Turns out, on Python 3
  this is nearly impossible to do because the majority of the standard
  library went an unicode path, even where bytes would be more
  appropriate (like cgi.FieldStorage, urllib.parse etc.)

- Graham, Robert (and now me as well) try to get charset guessing for
  URLs going, decide on latin1 for the HTTP headers.  latin1 could be
  re-decoded by the application if it really thinks it wanted utf-8
  for instance.  (Like cookie headers, only I guess only there)

- One idea is enforcing unicode for all Python versions

- One idea is going unicode for Python 3 and bytestrings for Python 2

- New (and old) discussions bring up the surrogate escapes.

So it's quite hard to follow because different people talk about
different ideas at the same time.  And so far none of them looks really
compelling.

 Is there a real need out there?
In python 3, yes.  Because the stdlib no longer works with bytes and the
bytes object has few string semantics left.


 Which is a worthy goal, IMHO. Java has been there since the very
 start, since java strings have always been unicode. Take a look at the
 java docs for HttpServlet: no methods return bytes/bytearrays.
And people appear to have problems with that, because what they are
doing is using a specified charset that is by default iso-8859-1:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

 Java programmers just tolerate this, although they may curse the
 developers of the servlet spec for not having solved their specific
 problem for them.
Many Java apps are also still using latin1 only or have all kinds of
problems with charsets.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Alan]
 Is there a real need out there?

[Armin]
 In python 3, yes.  Because the stdlib no longer works with bytes and the
 bytes object has few string semantics left.

Why can't we just do the same as the java servlet spec? I.E.

1. Ignore the encoding issues being discussed
2. Give the programmer (possibly mojibake) unicode strings in the WSGI
environ anyway
3. And let them solve their problems themselves, using server
configuration or bespoke middleware

[Alan]
 Java programmers just tolerate this, although they may curse the
 developers of the servlet spec for not having solved their specific
 problem for them.

[Armin]
 Many Java apps are also still using latin1 only or have all kinds of
 problems with charsets.

My point exactly.

Many web developers simply never have to deal with these issues,
perhaps a majority.

The ones that do have to sort it out for themselves.

To do so, the publishers of the various containers give them
(non-standard) options to control the decoding of the incoming request
and all of its component parts: you cited the Tomcat approach above.
Other containers do it differently. Which means that i18n knowledge is
not portable between containers.

It would be nice if we could avoid such a situation with i18n and WSGI.

But I suppose I'm a little dubious that this group can out-do the
enormous java community, and the enormous financial resources that
Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve
this complex problem satisfactorily.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Armin]
 Because that problem was solved a long ago in applications themselves.
 Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating
 on unicode.  And the way they do that is straightforward.

So what are we all discussing?

Those frameworks obviously have solved all of the problems of decoding
incoming request components, e.g.

1. SCRIPT_NAME
2. PATH_INFO
3. QUERY_STRING
4. Etc

from miscellaneous unknown character sets into unicode, with out any
mistakes, under all possible WSGI environments, e.g.

1. Mod_wsgi
2. Modjy (java servlets)
3. IIS
4. CGI
5. FCGI
6. Etc

So why not just adopt one of those mechanisms, e.g. Django, and make
it the de-facto standard? Since they all deliver unicode, python 3 is
no longer a problem, since it permits only unicode strings.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Armin]
 No, they know the character sets.

Hmmm, define know ;-)

[Armin]
 You tell them what character set you
 want to use.  For example you can specify utf-8, and they will
 decode/encode from/to utf-8.  But there is no way for the application to
 send information to the server before they are invoked to tell the
 server what encoding they want to use.

I see this as being the same as Graham's suggested approach of a
per-server configurable charset, which is then stored in the WSGI
dictionary, so that applications that have problems, i.e. that detect
mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo
the faulty decoding by the server.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread René Dudfield
On Tue, Sep 22, 2009 at 10:06 AM, Alan Kennedy a...@xhaus.com wrote:
 [Alan]
 Is there a real need out there?

 [Armin]
 In python 3, yes.  Because the stdlib no longer works with bytes and the
 bytes object has few string semantics left.

 Why can't we just do the same as the java servlet spec? I.E.

 1. Ignore the encoding issues being discussed
 2. Give the programmer (possibly mojibake) unicode strings in the WSGI
 environ anyway
 3. And let them solve their problems themselves, using server
 configuration or bespoke middleware

 [Alan]
 Java programmers just tolerate this, although they may curse the
 developers of the servlet spec for not having solved their specific
 problem for them.

 [Armin]
 Many Java apps are also still using latin1 only or have all kinds of
 problems with charsets.

 My point exactly.

 Many web developers simply never have to deal with these issues,
 perhaps a majority.

 The ones that do have to sort it out for themselves.

 To do so, the publishers of the various containers give them
 (non-standard) options to control the decoding of the incoming request
 and all of its component parts: you cited the Tomcat approach above.
 Other containers do it differently. Which means that i18n knowledge is
 not portable between containers.

 It would be nice if we could avoid such a situation with i18n and WSGI.

 But I suppose I'm a little dubious that this group can out-do the
 enormous java community, and the enormous financial resources that
 Sun, IBM, Oracle, etc, etc, plough into it. And still failed to solve
 this complex problem satisfactorily.

 Alan.


I think it's worth discussing and working something out that's good
(good in various ways).

As this is a python group, I think most of us think python does a
whole bunch of things better than java(maybe wrongly... but still)
;-)

cu,
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Armin Ronacher
Hi,

Alan Kennedy schrieb:
 Hmmm, define know ;-)
The charset of incoming data, the charset of URLs, the charset of
outgoing data, the charset of whatever the application uses, is what the
application decides it to be.  Most new applications go with utf-8 for
everything these days.

 I see this as being the same as Graham's suggested approach of a
 per-server configurable charset, which is then stored in the WSGI
 dictionary.
SCRIPT_NAME and PATH_INFO are different because URLs as entered by the
user will always be utf-8 in modern browsers.  Even if the application
decides to have latin1 URLs.

Of course a server configuration variable would be a solution for many
of these problems, but I don't like the idea of changing application
behavior based on server configuration.  At that point we will finally
have successfully killed the idea of nested WSGI applications, because
those could depend on different charsets.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Alan Kennedy
[Armin]
 Of course a server configuration variable would be a solution for many
 of these problems, but I don't like the idea of changing application
 behavior based on server configuration.

So you don't like the way that Django, Werkzeug, WebOb, etc, do it
now, even though they appear to be mostly successful, and you're happy
to cite them as such?

From the applications point of view, a framework-level configuration
variable is the same as a server-level configuration variable.

 At that point we will finally
 have successfully killed the idea of nested WSGI applications, because
 those could depend on different charsets.

Wouldn't well-written applications depend on unicode?

The server configured charset is simply an explicit statement of the
character set from which incoming requests are to be decoded, into
unicode, and no other character set.

Alan.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread René Dudfield
On Tue, Sep 22, 2009 at 12:12 PM, Armin Ronacher
armin.ronac...@active-4.com wrote:
 Hi,

 Alan Kennedy schrieb:
 So you don't like the way that Django, Werkzeug, WebOb, etc, do it
 now, even though they appear to be mostly successful, and you're happy
 to cite them as such?
 Server != Application.

 From the applications point of view, a framework-level configuration
 variable is the same as a server-level configuration variable.
 It is not.  I can configure my framework from within Python code, But I
 cannot change the webserver configuration from there.

 Wouldn't well-written applications depend on unicode?
 Only internally.  There is no such thing as Unicode in HTTP.


hi,

other points I agree with...

However, remember that there is unicode in HTTP these days.  As per
previous conversation on RFCs stating so... and real world use of
unicode in HTTP.

cheers,
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Massimo Di Pierro

Thank you Armin this makes things clear to me ( a newbie hre).

On Sep 22, 2009, at 3:29 AM, Armin Ronacher wrote:

- my initial plan was going bytes everywhere.  Turns out, on Python 3
 this is nearly impossible to do because the majority of the standard
 library went an unicode path, even where bytes would be more
 appropriate (like cgi.FieldStorage, urllib.parse etc.)


I would have taken the same stand.


- Graham, Robert (and now me as well) try to get charset guessing for
 URLs going, decide on latin1 for the HTTP headers.  latin1 could be
 re-decoded by the application if it really thinks it wanted utf-8
 for instance.  (Like cookie headers, only I guess only there)


If wsgi guesses the charset before will the application always be able  
to derive the original strings?



- One idea is enforcing unicode for all Python versions

- One idea is going unicode for Python 3 and bytestrings for Python 2


For what it matters I prefer the latter option.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 04:44 PM 9/22/2009 +1000, Graham Dumpleton wrote:

2009/9/22 Mark Nottingham m...@mnot.net:
 That blog entry is eleven printed pages. Given that PEP 333 also prints as
 eleven pages from my browser, I suspect there's some extraneous information
 in there.

 Could you please summarise? Requiring all comers to read such a voluminous
 entry is a considerable (and somewhat arbitrary) bar to entry for the
 discussion.

If you aren't willing to read the PEP to understand WSGI why are you
even wanting to participate in the discussion in the first place? This
is a quite detailed discussion about the future of the WSGI
specification and not an IRC channel manned by ticket monkeys. :-(


Um, Graham, Mark was a major contributor to the original PEP.  See:

   http://www.python.org/dev/peps/pep-0333/#acknowledgements

I assure you, he's read the PEP quite thoroughly.  ;-)

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 09:23 AM 9/22/2009 +0100, Alan Kennedy wrote:

[P.J. Eby]
 Actually, latin-1 bytes encoding is the *simplest* thing that could
 possibly work, since it works already in e.g. Jython, and is actually
 in the spec already...  and any framework that wants unicode URIs
 already has to decode them, so the code is already written.

[Armin]
 Except that nobody implements that

So, if nobody implements that, then why are we trying to standardise it?

Is there a real need out there?

Or are all these discussions solely driven by the need/desire to have
only unicode strings in the WSGI dictionary under python 3?

Which is a worthy goal, IMHO. Java has been there since the very
start, since java strings have always been unicode. Take a look at the
java docs for HttpServlet: no methods return bytes/bytearrays.

http://java.sun.com/products/servlet/2.5/docs/servlet-2_5-mr2/javax/servlet/http/HttpServletRequest.html

But the java servlet spec still ignores *all* of the encoding concerns
being discussed here. Which means that mistakes/mojibake must happen
all the time. And it's up to the author of the individual java web
application to solve those problems, using a mechanism appropriate for
their needs and local environment.


Right, and we're not going to be able to solve all the problems 
either.  What we want -- or at least what *I* want, is to ensure that 
the design doesn't generate NEW opportunities for f***ing it up.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 11:28 AM 9/22/2009 +0200, Armin Ronacher wrote:

Hi,

Alan Kennedy schrieb:
 2. Give the programmer (possibly mojibake) unicode strings in the WSGI
 environ anyway
 3. And let them solve their problems themselves, using server
 configuration or bespoke middleware
Because that problem was solved a long ago in applications themselves.
Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating
on unicode.  And the way they do that is straightforward.

Now currently what we have to do on Python 3 is to encode the data again
and decode it with the target charset.  Unnecessary roundtrips that just
slow the whole thing down.  What for?


What roundtrips?  If they're operating on unicode, either they're in 
violation of the spec (in which case, f*** them), or they're already 
running a decode every time they pull something out of the 
environ...  and using latin-1 or surrogates is only one encoding call 
different from what they're doing now.


So if anybody really cares about that one extra encode(), write a C 
function to do the transcode in a single step and add it to the 
stdlib for 2.x and 3, as well as publishing a standalone 
version.  Voila.  We're done and outta here.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote:

I see this as being the same as Graham's suggested approach of a
per-server configurable charset, which is then stored in the WSGI
dictionary, so that applications that have problems, i.e. that detect
mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo
the faulty decoding by the server.


This puts the burden on the wrong end of the pipe: there are more 
apps than servers and they would *all* have to check this in order to be sane.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 03:22 PM 9/22/2009 +0100, René Dudfield wrote:

On Tue, Sep 22, 2009 at 3:07 PM, P.J. Eby p...@telecommunity.com wrote:
 At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote:

 I see this as being the same as Graham's suggested approach of a
 per-server configurable charset, which is then stored in the WSGI
 dictionary, so that applications that have problems, i.e. that detect
 mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo
 the faulty decoding by the server.

 This puts the burden on the wrong end of the pipe: there are more apps than
 servers and they would *all* have to check this in order to be sane.


Except most everyone is using unicode in their apps already through 
frameworks.


Great, so only the frameworks need to change, and if we use utf8 
surrogateescape, only the applications which need non-utf8 encoding 
will need to do anything differently.


That's one factor weighing towards PEP 383, vs. continuing with 
latin-1 or going to bytes.


(Frankly, though, I'm getting tired of this handwaving about these 
frameworks that use unicode.  If they are putting objects of type 
'unicode' under WSGI-defined environ keys on Python 2.x, they are 
*not WSGI compliant*.  And conversely, if they are doing some kind of 
conversion already, it's not gonna kill them to do a slightly 
different conversion to support the new version of WSGI.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Robert Brewer
P.J. Eby [mailto:p...@telecommunity.com]
 At 07:40 PM 9/21/2009 -0700, Robert Brewer wrote:
  Yes; you have to transcode to the correct encoding. Once.
  Then every other WSGI application interface below that one
  doesn't have to care.
 
 You can only do that if you *break encapsulation*, which as I said
 earlier is voiding the entire point of having a modular interface.

Requiring one component to run before another to achieve a correct
result does not void modularity. Unix pipes employ a modular interface,
but cat /etc/fstab | wc | head produces a very different result than
cat /etc/fstab | head | wc. In such a system, encapsulation requires
that the components not share state, but rather trust that they are
composed correctly (yes, by some invisible hand) and that the given
input is the intended one, even if that means a previous component
transformed it.

If, on the other hand, only utf-8-decoded strings can be passed as input
to each WSGI component, then each WSGI component must be prepared to
re-decode its inputs; in that case, each must be configured identically
with the same logic to determine the correct decoding, since the correct
decoding does not differ from one component to the next. That repeated
configuration of the correct decoding is shared state, and breaks
encapsulation; one-time transformation of inputs is not and does not.

 Having a configurable encoding just means that *every* WSGI
 application *must* verify the encoding in order to be safe.

No, each can trust its inputs and do its intended job instead, if your
idempotency requirement is relaxed.

 I'm all
 in favor of making everyone suffer equally, but all else being equal,
 I'd prefer them to suffer idempotently rather than conditionally.  ;-)

I know you do, but I don't see the community following your lead in that
preference. Any middleware that alters the environ breaks idempotency.
Any middleware that alters the output breaks idempotency. Most routing
middleware breaks idempotency. There's a lot of all of those already in
the wild.

CherryPy doesn't care, because we marginalized WSGI middleware into near
obscurity. We did that in large part because of the idempotency
requirements of WSGI 1.0. We may have the only routing middleware that
you could mistakenly put in your stack twice and get the same result! So
I'm not fighting for myself/my framework on this; surrogateescape would
work just fine for us since we ship very little middleware.

But I don't think it would work fine for Paste, Pylons, Turbogears,
Repoze, etcetera etcetera who have lots of WSGI middleware to port and
more they want to build, and have been chafing for years now against
this requirement. I believe they want full unicode SCRIPT_NAME and
PATH_INFO, and would prefer a single, new, modular WSGI component be
inserted in their component graphs than to build that logic into every
WSGI component. They already have to deal with correct ordering in their
WSGI component graphs, because they've already abandoned strict
idempotency. Ben, Ian, Mark, Chris, et al, please confirm or deny that;
I could be way off base.


Robert Brewer
fuman...@aminus.org

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread And Clover

Alan Kennedy wrote:


Why can't we just do the same as the java servlet spec?


Because Servlet is a walking, stinking demonstration of how *not* to 
handle encodings.


Every servlet container has its own different method of selecting input 
character sets, and the default encoding is almost never right. Most 
deployed JSP applications out there are using the wrong charset and do 
the wrong thing with any non-ASCII character. This is not something to 
aim for.


Pushing the choice of encodings out to a 'deployment issue' where the 
application doesn't get to decide is a Wrong Thing. I hate dealing with 
this nonsense in Java and I do not want the same approach to become 
common in Python.


 I see this as being the same as Graham's suggested approach of a
 per-server configurable charset

This is absolutely the opposite of what I want as an application author. 
I want to hand out my WSGI application that uses UTF-8 and know that 
wherever it is deployed the non-ASCII characters will go through without 
getting mangled.


The application (perhaps via a framework it is using) is the party that 
is in the best place to know what character encoding it wants to deal 
with. Give the application a consistent way to handle that encoding 
itself, because the poor sod deploying it isn't going to know any better.


 Those frameworks obviously have solved all of the problems of decoding
 incoming request components from miscellaneous unknown character sets
 into unicode, with out any mistakes

Er, no. That's the point. It cannot currently be done in all deployment 
environments. When they're not running via their own built-in servers, 
the frameworks have to do the same as the rest of us: guess.


That guess may not be as troublesome as it is in Java (mainly because 
for us it doesn't affect QUERY_STRING parameters), but it's still not 
reliable, which is why today you can't have a WSGI application with 
pretty non-ASCII URLs that will deploy consistently. I want this fixed.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Armin Ronacher
Hi,

And Clover schrieb:
 This is absolutely the opposite of what I want as an application author. 
 I want to hand out my WSGI application that uses UTF-8 and know that 
 wherever it is deployed the non-ASCII characters will go through without 
 getting mangled.
I could not agree more.

Probably the best way is indeed using native strings for each Python
version, where native strings are unicode the server should latin1
decode it and SCRIPT_NAME / PATH_INFO will be called
wsgi.raw_script_name and wsgi.raw_path_info and be properly quoted.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Massimo Di Pierro

+1

On Sep 22, 2009, at 10:45 AM, Armin Ronacher wrote:


Hi,

And Clover schrieb:
This is absolutely the opposite of what I want as an application  
author.

I want to hand out my WSGI application that uses UTF-8 and know that
wherever it is deployed the non-ASCII characters will go through  
without

getting mangled.

I could not agree more.

Probably the best way is indeed using native strings for each Python
version, where native strings are unicode the server should latin1
decode it and SCRIPT_NAME / PATH_INFO will be called
wsgi.raw_script_name and wsgi.raw_path_info and be properly quoted.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread And Clover

Graham wrote:

 Armin has fast asleep now, so my shift.

Heh. It's a multiple-man job keeping up with this monster thread!


The URLs don't break.


Not in themselves. Just the language of the PEP implies that to fix them 
up would contravene the spec:


 The application MUST use [the encoding guess for PATH_INFO] to decode
 the ``'QUERY_STRING'`` as well.

This isn't appropriate even as a SHOULD: the guessed encoding for 
PATH_INFO is very likely to be wrong, in particular for cases where the 
path was purely ASCII.


The application (or a library/framework acting on its behalf) should be 
allowed to decode QUERY_STRING using whatever encoding it is expecting. 
Disallowing using anything other than utf-8 (and iso-8859-1 in a very 
unreliable way) makes it impossible to have queries in any other 
encoding at all and still comply with the spec, which is undesirable.


If this sentence is removed, and `wsgi.uri_encoding` is guaranteed to be 
one of:


  a. definitive and reliable, or
  b. missing/None

I'm pretty much happy. What I don't want is that half the future-WSGI 
servers/gateways decide they have to provide *some* value for 
`wsgi.uri_encoding` even if they're not quite sure if it's the right 
one. Then we're back to square one.



if it is known that an application or some subset of
URLs will always be receiving a request as non UTF-8, then it should
employ code in those cases to always transcode it to the required
encoding.


Yep, agreed. I think the PEP should clarify that; at the moment it is 
saying that a transcode is something you should only do for the 
iso-8859-1 case, but if you actually followed that advice you'd get 
highly inconsistent results. Perhaps we're at cross-purposes as to what 
exactly consistutes 'middleware'...



The other fallback is that a specific WSGI server could elect to
provide an option to not use 'UTF-8' as the first choice for decoding


I really, *really* hope this does not happen. That just brings us more 
deployment heartaches.



Whether surrogateescape gives a better solution I have no idea at this
point


Yeah... I'm highly suspicious of surrogateescape in a web context and 
personally my code will be deliberately filtering all such characters 
out. I can see it being a possible way to smuggle unwanted sequences 
(such as overlongs) through filters, potentially causing endless 
security problems. But we'll see...


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread René Dudfield
On Tue, Sep 22, 2009 at 3:07 PM, P.J. Eby p...@telecommunity.com wrote:
 At 11:30 AM 9/22/2009 +0100, Alan Kennedy wrote:

 I see this as being the same as Graham's suggested approach of a
 per-server configurable charset, which is then stored in the WSGI
 dictionary, so that applications that have problems, i.e. that detect
 mojibake in the unicode SCRIPT_NAME or PATH_INFO, can attempt to undo
 the faulty decoding by the server.

 This puts the burden on the wrong end of the pipe: there are more apps than
 servers and they would *all* have to check this in order to be sane.


Except most everyone is using unicode in their apps already through frameworks.

If the web clients are moving towards unicode, the HTTP RFCs(and most
other internet protocols), python, and python frameworks, and other
languages frameworks all moving towards unicode why should wsgi not?
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread Philip Jenvey


On Sep 22, 2009, at 2:28 AM, Armin Ronacher wrote:


Hi,

Alan Kennedy schrieb:
2. Give the programmer (possibly mojibake) unicode strings in the  
WSGI

environ anyway
3. And let them solve their problems themselves, using server
configuration or bespoke middleware

Because that problem was solved a long ago in applications themselves.
Webob, Werkzeug, Paste, Pylons, Django, you name it, all are operating
on unicode.  And the way they do that is straightforward.


Werkzeug/WebOb/Paste all seem to have standardized on: return unicode,  
lazily decoded via a default encoding which can be overridden by the  
app via some API.


The Java servlet spec actually defines a  
ServletRequest.setCharacterEncoding(String enc) method, which lets the  
app override the encoding of the body/params (though not the URL on  
some containers), as long it's done before the body is read. Pretty  
much what said Python wrappers are doing.


Now currently what we have to do on Python 3 is to encode the data  
again
and decode it with the target charset.  Unnecessary roundtrips that  
just

slow the whole thing down.  What for?


Because our request container is a plain, pre-fabricated dict that  
doesn't permit the lazy behavior.


--
Philip Jenvey
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread P.J. Eby

At 05:12 PM 9/22/2009 -0700, Philip Jenvey wrote:

Because our request container is a plain, pre-fabricated dict that
doesn't permit the lazy behavior.


Not quite true; you can always write a library function, 
get_foo(environ) that does the lazy caching in a private environ key, 
at the cost of also caching the original value and doing a consistency check.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Armin Ronacher
Hi,

Robert Brewer schrieb:
 urllib.unquote, for one. We had to make a version which accepts bytes
 (and outputs bytes). But it's only 8 lines of code.
Here a patch for urllib.parse that restores Python 2.x behavior.
Because it also changes behavior for Python 3.x I have not yet submitted
it for discussions: http://paste.pocoo.org/show/140739/

This adds byte support for all unquoting functions and URL parsing and
joining.  It also changes the quoting functions to return bytes when
passed bytes.  The latter is something that most likely does not survive
a review on python-dev.

Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread James Bennett
On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough chr...@plope.com wrote:
 WSGI is a fairly low-level protocol aimed at folks who need to interface a
 server to the outside world.  The outside world (by its nature) talks bytes.
  I fear that any implied conversion of environment values and iterable
 return values to Unicode will actually eventually make things harder than
 they are now.  I realize that it would make middleware implementors lives
 harder to need to deal in bytes.  However, at this point, I also believe
 that middleware kinda should be hard.  We have way too much middleware that
 shouldn't be middleware these days (some written by myself).

Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an
interface to HTTP should deal in bytes as well.

The problem, really is that despite being a very low-level interface,
WSGI has a tendency to leak up into much higher-level code, and (IMO)
authors of that high-level code really shouldn't have to waste their
time dealing with details of the underlying low-level gateway.

You've said you don't want to hear Python 3 as the reason, but it
provides some useful examples: in high-level code you'll commonly want
to be doing things like, say, comparing parts of the requested URL
path to known strings or patterns. And that high-level code will
almost certainly use strings, while WSGI, in theory, will be using
bytes. That's just a recipe for disaster; if WSGI mandates bytes, then
bytes will have to start infecting much higher-level code (since
Python 3 -- rightly -- doesn't let you be nearly as promiscuous about
mixing bytes and strings).

Once I'm at a point where I can use Python 3, I know I'll personally
be looking for some library which will normalize everything for me
before I interact with it, precisely to avoid this sort of leakage; if
WSGI itself would at least *allow* that normalization to happen at the
low level (mandating it is another discussion entirely) I'd feel much
happier about it going forward.


-- 
Bureaucrat Conrad, you are technically correct -- the best kind of correct.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Armin Ronacher
Hi,

James Bennett schrieb:
 Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an
 interface to HTTP should deal in bytes as well.
If it was just that I would be happy to stay with bytes.  But unless the
standard library changes in the way it works on Python 3 there is not
much but unicode we can use.  bytes no longer behave like strings, it's
not very comfortable to work with them.

Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread James Bennett
On Mon, Sep 21, 2009 at 1:28 AM, Armin Ronacher
armin.ronac...@active-4.com wrote:
 If it was just that I would be happy to stay with bytes.  But unless the
 standard library changes in the way it works on Python 3 there is not
 much but unicode we can use.  bytes no longer behave like strings, it's
 not very comfortable to work with them.

Indeed. Hence my comments about WSGI leaking up into other code. Now
that bytes and strings are incompatible, a lot of code which relied on
(arguably) a wart in Python will break.


-- 
Bureaucrat Conrad, you are technically correct -- the best kind of correct.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Chris McDonough

OK, after some consideration, I think I'm sold.

Answering my own original question about why unicode seems to make sense as 
values in the WSGI environment even without consideration for Python 3 
compatibility:  *something* needs to do this translation.  Currently I 
personally rely on WebOb to do a lot of this translation.  I can't think of a 
good reason that implementations at the level of WebOb would each need to do 
this translation work; pushing the job into WSGI itself seems to make sense 
here.  This is particularly true for PATH_INFO and QUERY_STRING; these days 
it's foolish to assume these values will be entirely composed of low order 
characters, and thus being able to access them as bytes natively isn't very useful.


OTOH, I suspect the Python 3 stdlib is still broken if it requires native 
strings in various places (and prohibits the use of bytes).


James Bennett wrote:

On Sun, Sep 20, 2009 at 11:25 PM, Chris McDonough chr...@plope.com wrote:

WSGI is a fairly low-level protocol aimed at folks who need to interface a
server to the outside world.  The outside world (by its nature) talks bytes.
 I fear that any implied conversion of environment values and iterable
return values to Unicode will actually eventually make things harder than
they are now.  I realize that it would make middleware implementors lives
harder to need to deal in bytes.  However, at this point, I also believe
that middleware kinda should be hard.  We have way too much middleware that
shouldn't be middleware these days (some written by myself).


Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an
interface to HTTP should deal in bytes as well.

The problem, really is that despite being a very low-level interface,
WSGI has a tendency to leak up into much higher-level code, and (IMO)
authors of that high-level code really shouldn't have to waste their
time dealing with details of the underlying low-level gateway.

You've said you don't want to hear Python 3 as the reason, but it
provides some useful examples: in high-level code you'll commonly want
to be doing things like, say, comparing parts of the requested URL
path to known strings or patterns. And that high-level code will
almost certainly use strings, while WSGI, in theory, will be using
bytes. That's just a recipe for disaster; if WSGI mandates bytes, then
bytes will have to start infecting much higher-level code (since
Python 3 -- rightly -- doesn't let you be nearly as promiscuous about
mixing bytes and strings).

Once I'm at a point where I can use Python 3, I know I'll personally
be looking for some library which will normalize everything for me
before I interact with it, precisely to avoid this sort of leakage; if
WSGI itself would at least *allow* that normalization to happen at the
low level (mandating it is another discussion entirely) I'd feel much
happier about it going forward.




___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 7:28 AM, Armin Ronacher
armin.ronac...@active-4.com wrote:
 Hi,

 James Bennett schrieb:
 Well, ordinarily I'd be inclined to agree: HTTP deals in bytes, so an
 interface to HTTP should deal in bytes as well.
 If it was just that I would be happy to stay with bytes.  But unless the
 standard library changes in the way it works on Python 3 there is not
 much but unicode we can use.  bytes no longer behave like strings, it's
 not very comfortable to work with them.


I think http traffic is increasingly more utf-8 these days.  Also most
upper level frame works use unicode natively.

So it makes sense to use utf-8 natively, as an option.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough chr...@plope.com wrote:

 OTOH, I suspect the Python 3 stdlib is still broken if it requires native
 strings in various places (and prohibits the use of bytes).

yes, python3 stdlib should support 'str'(the old unicode), 'buffer'
and 'bytes' for web using stuff.  Buffer is important because it's a
type also used for sockets(along with bytes) and it allows less memory
allocation (because you can reuse buffers).

cheers,
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Georg Brandl
René Dudfield schrieb:
 On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough 
 chrism-ccarnewbnkgavxtiumw...@public.gmane.org wrote:

 OTOH, I suspect the Python 3 stdlib is still broken if it requires native
 strings in various places (and prohibits the use of bytes).
 
 yes, python3 stdlib should support 'str'(the old unicode), 'buffer'
 and 'bytes' for web using stuff.  Buffer is important because it's a
 type also used for sockets(along with bytes) and it allows less memory
 allocation (because you can reuse buffers).

Please don't confuse readers and use the correct name, i.e. 'bytearray'
instead of 'buffer'.

Georg

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 9:46 AM, Georg Brandl g.bra...@gmx.net wrote:
 René Dudfield schrieb:
 On Mon, Sep 21, 2009 at 8:10 AM, Chris McDonough 
 chrism-ccarnewbnkgavxtiumw...@public.gmane.org wrote:

 OTOH, I suspect the Python 3 stdlib is still broken if it requires native
 strings in various places (and prohibits the use of bytes).

 yes, python3 stdlib should support 'str'(the old unicode), 'buffer'
 and 'bytes' for web using stuff.  Buffer is important because it's a
 type also used for sockets(along with bytes) and it allows less memory
 allocation (because you can reuse buffers).

 Please don't confuse readers and use the correct name, i.e. 'bytearray'
 instead of 'buffer'.

 Georg


Let me try and reduce the confusion...

There are two different python types the py3k socket module uses:
'bytes' and 'buffer'.  'bytes' is kind of like str in python3... but
with reduced functionality (no formatting, less methods etc).  buffer
is a Py_buffer from the c api.

buffer, and bytes in socket:
http://docs.python.org/3.1/library/socket.html#socket.socket.recvfrom_into
bytearray: http://docs.python.org/3.1/library/functions.html#bytearray
bytes: http://docs.python.org/3.1/library/functions.html#bytes
buffer: http://docs.python.org/3.1/c-api/buffer.html

This is separate, but related to the point of bytes vs unicode.  It is
really (bytes and buffer) vs unicode - since bytes and buffer can be
used with socket.  socket never uses a python2 'unicode', or a python3
'str' type.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote:
 At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote:

 Anyway, for us slower (and maybe wrongly fearful) folks, could someone
 summarize the benefits of having a WSGI specification that requires Unicode.
 Bonus points for an explanation that does not boil down to it will be
 compatible with Python 3.

 +1.  I'd really rather not have the spec dictated by the need to work around
 problems in the stdlib or language definition.  Better to fix them ASAP.


hi,

here is a summary:
Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most things
layered on top of wsgi are using utf-8 (django etc), and lots of web
clients are using utf-8 (firefox etc).

Why not move to unicode?
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 4:27 PM, James Bennett ubernost...@gmail.com wrote:
 On Mon, Sep 21, 2009 at 10:19 AM, P.J. Eby p...@telecommunity.com wrote:
 +1.  I'd really rather not have the spec dictated by the need to work around
 problems in the stdlib or language definition.  Better to fix them ASAP.

 This is a *Python* web server gateway interface, yes? Fixing stdlib
 bugs is fine, but asking for the language to change just to make
 gateway interfaces a bit easier to write seems a bit much; I'd hope we
 can take Python the language as granted, and work from there.



Hi,

I mostly agree...  However, python3.x changes are still up for
grabs... so if there's a good enough reason, now is the time to ask
for changes.  I don't see them changing the way unicode, strings and
bytes work too much though.

cheers,
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 04:30 PM 9/21/2009 +0100, René Dudfield wrote:

On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote:
 At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote:

 Anyway, for us slower (and maybe wrongly fearful) folks, could someone
 summarize the benefits of having a WSGI specification that 
requires Unicode.

 Bonus points for an explanation that does not boil down to it will be
 compatible with Python 3.

 +1.  I'd really rather not have the spec dictated by the need to 
work around

 problems in the stdlib or language definition.  Better to fix them ASAP.


hi,

here is a summary:
Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most things
layered on top of wsgi are using utf-8 (django etc), and lots of web
clients are using utf-8 (firefox etc).


Since WSGI is based on HTTP, please cite RFCs, not applications.  Thanks.

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 4:42 PM, P.J. Eby p...@telecommunity.com wrote:
 At 04:30 PM 9/21/2009 +0100, René Dudfield wrote:

 On Mon, Sep 21, 2009 at 4:19 PM, P.J. Eby p...@telecommunity.com wrote:
  At 12:25 AM 9/21/2009 -0400, Chris McDonough wrote:
 
  Anyway, for us slower (and maybe wrongly fearful) folks, could someone
  summarize the benefits of having a WSGI specification that requires
  Unicode.
  Bonus points for an explanation that does not boil down to it will be
  compatible with Python 3.
 
  +1.  I'd really rather not have the spec dictated by the need to work
  around
  problems in the stdlib or language definition.  Better to fix them ASAP.
 

 hi,

 here is a summary:
    Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most things
 layered on top of wsgi are using utf-8 (django etc), and lots of web
 clients are using utf-8 (firefox etc).

 Since WSGI is based on HTTP, please cite RFCs, not applications.  Thanks.



Hi,

That seems a strange thing to say.  HTTP use is based on not only RFCs
but real applications.  Web Server Gateway Interface is not just about
HTTP obviously, and talks about python and web server issues... it
hardly restricts itself to HTTP.

See IRIs:  http://www.w3.org/International/O-URL-and-ident.html
Which links to a number of things including rfc2718, which specifies
utf-8 for URIs: http://www.ietf.org/rfc/rfc2718.txt

Character encoding section:
Unless there is some compelling reason for a particular scheme to
do otherwise, translating character sequences into UTF-8 (RFC 2279)
[3] and then subsequently using the %HH encoding for unsafe octets is
recommended.

Which seems sensible.

Having fallback to the raw bytes available also seems sensible.  For
the reasons discussed in previous posts.



cheers,
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Ian Bicking
On Sun, Sep 20, 2009 at 8:06 AM, Armin Ronacher
armin.ronac...@active-4.com wrote:
 Thanks to Graham Dumpleton and Robert Brewer there is some serious
 progress on WSGI currently.  I proposed a roadmap with some PEP changes
 now that need some input.

 Summary:

  WSGI 1.0       stays the same as PEP 0333 currently is
  WSGI 1.1       becomes what Ian and I added to PEP 0333
  WSGI 2.0       becomes a unicode powered version of WSGI 1.1
  WSGI 3.0       becomes WSGI 2.0 just without start_response

  WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python
  3 because of changes in the standard library that no longer work with
  a byte-only approach.

1.1 I think of as an errata on 1.0, so... simple enough.

I was skeptical about a unicode version of WSGI, but I think I'm okay
with it now.  For people who use UTF-8-only it should be fairly simple
and easy; for people who want to deal with other encodings, backward
compatible URLs, or other weirdness I think surrogateescape can
resolve the small handful of problems.  Maybe an option to use latin1
(at the server level) would do the same for Python 2, as a deployment
option for people who are dealing with these tricky issues.  Which is
kind of lame, but it means everything is still *possible*, and the use
cases are somewhat obscure.  Especially because QUERY_STRING and
wsgi.input remain bytes.  (Well, I guess the other case would be
someone reading a cookie set by an application they do not control,
and set in a crazy way... but anyway, there's a handful of use cases
where things get tricky, but we can kind of punt, or try to implement
the necessary transcoding routines before the spec is final.)  I'm
very much opposed to a second raw version of the request, as I do
not like redundancy.

With respect to 3.0/start_response, I'd rather we just do both at
once, so there's not so many versions of WSGI to worry about.  Also it
doesn't feel like a very difficult change to make.

The only other major issue is wsgi.input, which is a quite awkward
interface to the request body.  But I think resolving that is harder
than start_response, in particular because there's no clear solution.
Maybe at least switching to a file interface would be better.

-- 
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Robert Brewer
P.J. Eby wrote:
 At 07:57 AM 9/21/2009 +0200, Armin Ronacher wrote:
 Chris McDonough schrieb:
   Personally, I find it a bit hard to get excited about Python 3 as
a
   web application deployment platform.
  Everybody feels that way currently.  But if we don't fix WSGI that
  will never change.
 
 This is only compounding the errors introduced by the make the tests
 pass philosophy of porting the stdlib.  We should not make them
 worse.
 
 At the moment (AFAIK) nobody has gone through the web bits of the
 stdlib and asked, Should this work on strings, bytes, or both, and
 if both, how should that API be expressed?

Perhaps not, but I wrote unquote_bytes at PyCon 2009, after discussing
urllib in the python-dev room and being told no bytes-compatible version
was desired in the stdlib. So *some* thought has gone into it.


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Henry Precheur
On Mon, Sep 21, 2009 at 11:09:24AM -0500, Ian Bicking wrote:
 I think surrogateescape can resolve the small handful of problems.

+1

surrogateescape would be a great alternative to the try utf-8 then
latin-1 approach. It would simplify the gateway and the application. No
need to check some 'encoding' variable and transcode later. We just
encode everything to UTF-8, no special case.

surrogateescape isn't implemented (yet?) for Python 2. That's not an
issue if the 'new' WSGI sticks to native strings.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread And Clover

Armin Ronacher wrote:


The middleware can never know.


It's much more likely than to know than the server though!

 WSGI will demand UTF-8 URLs and only
 provide iso-XXX support for backwards compatibility.

It doesn't sound much like backwards compatibility to me if non-UTF-8 
URLs break as soon as they coincidentally happen to be UTF-8 byte 
sequences. I'm as much an advocate of UTF-8 for everything everywhere! 
as anyone else, but unfortunately today there are still dark places 
where you need non-UTF-8 URLs.


Incidentally, if wsgi.uri_encoding is going to be the way to signal that 
the server has decoded bytes to characters using a known encoding, it 
should be stressed that this should only be set when that encoding is 
certain.


That is, wsgi.uri_encoding should be omitted (or None?) in cases where 
another party has already decoded (and maybe mangled) the bytes using an 
unknown encoding. In particular, CGI.


(In the case of Windows CGI the server will have decoded URI bytes into 
Unicode characters, using a charset which it is impossible to find out. 
In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid 
UTF sequence, otherwise it's the system codepage. This problem affects 
the non-CGI implementation isapi_wsgi, too. Then the variables are read 
as environment variables, which for Python 2 means another encode/decode 
step on Windows using the system codepage, mangling non-codepage 
characters. Python 3 has the opposite problem reading byte envvars using 
UTF-8, which won't be how Apache put them there.)


If wsgi.encoding is obligatory then in reality it will often be wrong, 
leaving us in the same pathetic predicament as with WSGI 1.0, where 
non-ASCII URIs don't work reliably at all.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Robert Brewer
René Dudfield wrote:
 On Mon, Sep 21, 2009 at 6:05 PM, Robert Brewer fuman...@aminus.org
 wrote:
  Armin Ronacher wrote:
  WSGI will demand UTF-8 URLs and only
  provide iso-XXX support for backwards compatibility.
 
  WSGI cannot demand that; a recommendation for utf-8 in a few draft
  specifications is at least a decade removed from ubiquitous
  implementation. We can default to utf-8 at best. I discussed this at
  length in
  http://mail.python.org/pipermail/web-sig/2009-August/003948.html
 
 
 that post does have good arguments why a single encoding is not
 acceptable.  utf-8 seems the most common at this point to be the
 default... but we do need a way to specify encoding.
 
 Is that what you're saying Robert?  Do you have a suggestion for
 specifying encodings?

CherryPy 3.2 does this (pseudocode):

try:
decode_uri(userdefault or 'utf-8')
except UnicodeDecodeError:
decode_uri('iso-8859-1')

 I think surrogateescape will handle the issues with allowing bytes to
 be stored in utf-8.
 http://www.python.org/dev/peps/pep-0383/
 
 However, I think that is only implemented in python 3.1?... but maybe
 there is someway to have it work on other pythons too?

As Henry Prêcheur says, that's not an issue if the 'new' WSGI sticks to native 
strings. Which I'd be happy with.

 How about...
 
 Being able to request which encoding you want has the benefit of only
 having to store one representation before 'baking' the result into the
 environ.  So if someone only ever wants utf-8 they can get it...
 however if they choose to 'bake' the environ then they can request
 something else.  This is similar to a per server setting, but I think
 should work with middleware too?

As noted above, it *is* a per-server setting in CherryPy 3.2. And any 
middleware can certainly be configured as its authors see fit; I don't see a 
need for a generic mechanism to specify what encodings middleware should try. 
However, we still need a generic mechanism declaring which encoding was 
successfully used; this is 'wsgi.uri_encoding'.

 As multiple things should be
 available, and if baked middleware (if it wants to modify things, will
 need to change each version of things).
 
 These 'baking' methods could live in wsgi to simplify modifying the
 environs multiple versions of things. It would just have some get/set
 functions to put correct handling of encodings in one place.  Of
 course middleware is still free to change things as it wants.

I still don't see why the environ should have multiple versions of anything. 
It's not as if the HTTP request gives us multiple Request-URI's. There's a 
single processing step that has to happen somewhere: decoding the bytes of the 
Request-URI to unicode. For the vast majority of apps, it should only happen 
once. Twice is acceptable to me for some apps. As I pointed out in the linked 
email, doing that as soon as possible (i.e. in the WSGI origin server) allows 
URI's to be compared as character strings more easily. If you deploy a piece of 
middleware that transcodes (based on more information than servers want to deal 
with), it had better be nearly first in the stack so routing works reliably.


Robert Brewer
fuman...@aminus.org


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:
I still don't see why the environ should have multiple versions of 
anything. It's not as if the HTTP request gives us multiple 
Request-URI's. There's a single processing step that has to happen 
somewhere: decoding the bytes of the Request-URI to unicode. For the 
vast majority of apps, it should only happen once. Twice is 
acceptable to me for some apps. As I pointed out in the linked 
email, doing that as soon as possible (i.e. in the WSGI origin 
server) allows URI's to be compared as character strings more 
easily. If you deploy a piece of middleware that transcodes (based 
on more information than servers want to deal with), it had better 
be nearly first in the stack so routing works reliably.


The problem with this whole approach is that it's not 
composable.  You can't stick in an application under a router that 
uses a different method for grokking its subtree of the URI space, 
unless it knows what's been done to the URI and can un-do it.


Maybe I'm missing something here, but the only way I see to preserve 
composability here is to use latin-1 or bytes.


The fundamental problem is that, like it or not, HTTP headers are 
actually byte strings.  The *only* reason we ever supported unicode 
in WSGI was to handle platforms where there's no such thing as a 
non-unicode string, and there we made it explicit that it's just a 
way of manipulating *bytes*, not unicode.


ISTM that very few (if any) of the proposals floating around for 
modifying WSGI are taking this concept into account.  Most of them 
sound to me like people saying, yeah, but this particular hack will 
work for *my* apps...  so everybody else must be doing something stupid.


But WSGI was built on the principle of *equally inconveniencing 
everyone*, specifically to avoid an impossible attempt at consensus 
between incompatible ways of doing things.  (E.g., nine million 
request/response APIs.)


So, if the only problem we're going to cause by using bytes 
everywhere is to make everyone need to change their routing code on 
Python 3, I vote +1000.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread René Dudfield
On Mon, Sep 21, 2009 at 8:31 PM, P.J. Eby p...@telecommunity.com wrote:
 At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:

 I still don't see why the environ should have multiple versions of
 anything. It's not as if the HTTP request gives us multiple Request-URI's.
 There's a single processing step that has to happen somewhere: decoding the
 bytes of the Request-URI to unicode. For the vast majority of apps, it
 should only happen once. Twice is acceptable to me for some apps. As I
 pointed out in the linked email, doing that as soon as possible (i.e. in the
 WSGI origin server) allows URI's to be compared as character strings more
 easily. If you deploy a piece of middleware that transcodes (based on more
 information than servers want to deal with), it had better be nearly first
 in the stack so routing works reliably.

 The problem with this whole approach is that it's not composable.  You can't
 stick in an application under a router that uses a different method for
 grokking its subtree of the URI space, unless it knows what's been done to
 the URI and can un-do it.


It seems latin-1 has the same problem.  If middleware makes an
artbitary change, how can later things know what it's done?
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Robert Brewer
P.J. Eby wrote:
 At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:
 I still don't see why the environ should have multiple versions of
 anything. It's not as if the HTTP request gives us multiple
 Request-URI's. There's a single processing step that has to happen
 somewhere: decoding the bytes of the Request-URI to unicode. For the
 vast majority of apps, it should only happen once. Twice is
 acceptable to me for some apps. As I pointed out in the linked
 email, doing that as soon as possible (i.e. in the WSGI origin
 server) allows URI's to be compared as character strings more
 easily. If you deploy a piece of middleware that transcodes (based
 on more information than servers want to deal with), it had better
 be nearly first in the stack so routing works reliably.
 
 The problem with this whole approach is that it's not
 composable.  You can't stick in an application under a router that
 uses a different method for grokking its subtree of the URI space,
 unless it knows what's been done to the URI and can un-do it.

I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are unicode, the only 
answer to what's been done to the URI? can be wsgi.uri_encoding, which 
allows someone to un-do it. What more do you want?

1. bytes arrive. server decodes with utf8, sets 'wsgi.uri_encoding' to 'utf-8'.
2. middleware says oops, that's wrong. encodes back to bytes using 'utf-8', 
and re-decodes with koi-8, changing wsgi.uri_encoding to 'koi-8'
3. further middlewares and app use the unicode value, and don't really care 
what encoding was used.

 Maybe I'm missing something here, but the only way I see to preserve
 composability here is to use latin-1 or bytes.
 
 The fundamental problem is that, like it or not, HTTP headers are
 actually byte strings.  The *only* reason we ever supported unicode
 in WSGI was to handle platforms where there's no such thing as a
 non-unicode string, and there we made it explicit that it's just a
 way of manipulating *bytes*, not unicode.
 
 ISTM that very few (if any) of the proposals floating around for
 modifying WSGI are taking this concept into account.  Most of them
 sound to me like people saying, yeah, but this particular hack will
 work for *my* apps...  so everybody else must be doing something
 stupid.
 
 But WSGI was built on the principle of *equally inconveniencing
 everyone*, specifically to avoid an impossible attempt at consensus
 between incompatible ways of doing things.  (E.g., nine million
 request/response APIs.)
 
 So, if the only problem we're going to cause by using bytes
 everywhere is to make everyone need to change their routing code on
 Python 3, I vote +1000.  ;-)

That's not the only problem. Using native strings wherever possible makes web 
programing in Python easier, regardless of version. In Python 3, that happens 
to be unicode, for good reasons.

For HTTP, there's a more specific reason: URI's should be compared for 
equivalence character by character, not byte by byte. See 
http://tools.ietf.org/html/rfc3986#section-6.2.1. That includes routing 
middleware.


Robert Brewer
fuman...@aminus.org

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 01:15 PM 9/21/2009 -0700, Robert Brewer wrote:
I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are 
unicode, the only answer to what's been done to the URI? can be 
wsgi.uri_encoding, which allows someone to un-do it. What more do you want?


To be sure that there's no possible way for all the broken middleware 
out there to mess this up.


Let me put it this way: out of all the times I've seen people post 
example WSGI 1 middleware code, I don't remember *any* where the 
middleware was actually complying with the spec correctly...  and 
that includes examples I wrote myself.  So I'm not real impressed 
with any solution that requires middleware to get it right.


That having been said, I'm beginning to think that PEP 383 
(surrogateescape) is actually the way to go, now that I've looked 
over the PEP, docs, and Ian's posts here about it.


First, it's compatible with CGI (os.environ) right off the bat, as 
well as being the standard way to handle this sort of issue in Python 3.


Second, it's redundancy-free: you don't need a separate environ key 
to know what's going on.


Third, it's unconditional: if you want bytes or a non-UTF-8 encoding 
you perform the same steps every time.


Up until now, I've not paid much attention because so many people 
kept saying you can't get surrogateescape on Python 2.  However, 
that's only an issue for code that *needs the original byte string*, 
as the old codec error handler API is sufficient for doing 
decoding.  (Meaning you could register a handler for it on older Pythons.)


I think this approach would let us have our cake and eat it too, for 
the most part.  WSGI on Python 2.x uses byte strings for these, and 
then 3.x works transparently.  It's a bit of a stretch to call it a 
clarification of WSGI 1.0, but since for all intents and purposes 
WSGI doesn't really *run* on Python 3, it might be the way to go.


To be clear, I'm talking about simply allowing (on Python 3 and in 
WSGI versions1.0) for all environ values to be utf-8-decoded, 
surrogate-escaped unicode values, in the native string case.  (This 
would further imply that a CGI gateway would have to check whether 
the system encoding is UTF-8, and if not, transcode accordingly.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Henry Precheur
On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote:
 So the same standard should have different behavior on different Python
 versions?  That would make framework code a lot more complicated.

I don't understand why it would be 'a lot more' complicated.

(The following code snippets is Python 3 only, and assumes we're using
'native strings' everywhere)

In the gateway, environ would be populated this way:

  environ['some_key'] = some_value.decode('utf8', 'surrogateescape')

Compare that to the utf-8-then-latin-1 alternative:

  try:
  environ['some_key'] = some_value.decode('utf-8')
  environ['some_key.encoding'] = 'utf-8'
  except UnicodeError:
  environ['some_key'] = some_value.decode('latin-1')
  environ['some_key.encoding'] = 'latin-1'


What you would have in the application to get the original value:

  environ['some_key'].encode('utf8', 'surrogateescape')

With utf8-then-latin1:

  environ['some_key'].encode(environ['some_key.encoding'])


The 'surrogateescape' way is clearly simpler. The 'equivalent' Python 2
code is even simpler:

  environ['some_key'] = some_value

And:

  environ['some_key']


-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Robert Brewer
Henry Precheur wrote:
 On Mon, Sep 21, 2009 at 09:14:13PM +0200, Armin Ronacher wrote:
  So the same standard should have different behavior on different
  Python versions?  That would make framework code a lot more
complicated.
 
 I don't understand why it would be 'a lot more' complicated.
 
 (The following code snippets is Python 3 only, and assumes we're using
 'native strings' everywhere)
 
 In the gateway, environ would be populated this way:
 
   environ['some_key'] = some_value.decode('utf8', 'surrogateescape')
 
 Compare that to the utf-8-then-latin-1 alternative:
 
   try:
   environ['some_key'] = some_value.decode('utf-8')
   environ['some_key.encoding'] = 'utf-8'
   except UnicodeError:
   environ['some_key'] = some_value.decode('latin-1')
   environ['some_key.encoding'] = 'latin-1'
 
 
 What you would have in the application to get the original value:
 
   environ['some_key'].encode('utf8', 'surrogateescape')
 
 With utf8-then-latin1:
 
   environ['some_key'].encode(environ['some_key.encoding'])
 
 
 The 'surrogateescape' way is clearly simpler.

It looks simpler until you have a site that is not primarily utf-8. In
that case, you multiply your (1 line * number of middlewares in the WSGI
stack * each request). With wsgi.uri_encoding you get either (1 line * 1
middleware designed to transcode * each request), or even 0 if your
whole site uses just one charset.


Robert Brewer
fuman...@aminus.org

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Henry Precheur
On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote:
 It looks simpler until you have a site that is not primarily utf-8. In
 that case, you multiply your (1 line * number of middlewares in the WSGI
 stack * each request).
 With wsgi.uri_encoding you get either (1 line * 1
 middleware designed to transcode * each request), or even 0 if your
 whole site uses just one charset.

I am not sure I understand your point.

The 0 lines hold true if the whole site is using latin-1 or utf-8 and
you write your applications/middlewares only for this site. But if it's
using any other encoding you still have to transcode.

def middleware(start_response, environ):
value = environ['some_key'].\
encode('utf8', 'surrogateescape').\
decode(SITE_ENCODING)
...

With wsgi.uri_encoding you would still have to do the following:

def middleware(start_response, environ):
value = environ['some_key'].\
encode(environ['some_key.encoding']).\
decode(SITE_ENCODING)
...

Of course you can directly use `environ['some_key']` if you know you'll
get the 'right' encoding all the time. But when the encoding changes,
you'll have to fix all your middlewares.


I am missing something?

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
2009/9/22 Henry Precheur he...@precheur.org:
 On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote:
 It looks simpler until you have a site that is not primarily utf-8. In
 that case, you multiply your (1 line * number of middlewares in the WSGI
 stack * each request).
 With wsgi.uri_encoding you get either (1 line * 1
 middleware designed to transcode * each request), or even 0 if your
 whole site uses just one charset.

 I am not sure I understand your point.

 The 0 lines hold true if the whole site is using latin-1 or utf-8 and
 you write your applications/middlewares only for this site. But if it's
 using any other encoding you still have to transcode.

 def middleware(start_response, environ):
    value = environ['some_key'].\
        encode('utf8', 'surrogateescape').\
        decode(SITE_ENCODING)
    ...

 With wsgi.uri_encoding you would still have to do the following:

 def middleware(start_response, environ):
    value = environ['some_key'].\
        encode(environ['some_key.encoding']).\
        decode(SITE_ENCODING)
    ...

 Of course you can directly use `environ['some_key']` if you know you'll
 get the 'right' encoding all the time. But when the encoding changes,
 you'll have to fix all your middlewares.


 I am missing something?

For one, we aren't talking about arbitrary keys needing this treatment.

We are only talking about SCRIPT_NAME and PATH_INFO.

Everything else from CGI will be passed as ISO-8859-1 and up to WSGI
components/applications to explicitly worry about those if need to
deal with them in special ways. Eg., REQUEST_URI, QUERY_STRING,
HTTP_COOKIE, HTTP_REFERRER.

Thus, your use of 'some_key' all the time is a bit confusing when just
trying to scan the emails quickly.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 03:26 PM 9/21/2009 -0700, Robert Brewer wrote:

It looks simpler until you have a site that is not primarily utf-8. In
that case, you multiply your (1 line * number of middlewares in the WSGI
stack * each request). With wsgi.uri_encoding you get either (1 line * 1
middleware designed to transcode * each request), or even 0 if your
whole site uses just one charset.


Unless I'm misunderstanding something, you end up adding an extra 
if statement *everywhere*, to check whether wsgi.uri_encoding is 
what you want it to be or not.


(Btw, this whole notion of talking about WSGI sites also doesn't 
make sense, since WSGI doesn't have sites, it has 
recursively-composable application objects.  Sure, if you're using a 
monolithic framework, you can think of applications as unified 
entities, but that's not true of WSGI as a whole.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Mark Nottingham
HTTP headers *are* ASCII; RFC2616 defined them to be ISO-8859-1, but  
HTTPbis currently takes the stance that they're ASCII, as in practice  
Latin-1 isn't used and may introduce interop problems.


http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-07#section-4.2 




   Historically, HTTP has allowed field-content with text in the ISO-
   8859-1 [ISO-8859-1] character encoding (allowing other character  
sets

   through use of [RFC2047] encoding).  In practice, most HTTP header
   field-values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD constrain their field-values to
   US-ASCII characters.  Recipients SHOULD treat other (obs-text)  
octets

   in field-content as opaque data.


What does it mean to support non-ASCII headers? As per above, the  
only sane thing to do is treat them as opaque data, because you can't  
be certain of their encoding unless you have knowledge of the header.




On 21/09/2009, at 12:50 AM, Armin Ronacher wrote:

Also (something I haven't yet filed as a bug because I guess there  
will
be more changes involved) the HTTP server in Python 3.1 does not  
support

non-ASCII headers.



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Mark Nottingham
+1. There is no one answer for these issues (e.g., URI-IRI conversion  
can lose information), so low-level infrastructure like WSGI shouldn't  
be making choices for people.



On 22/09/2009, at 5:31 AM, P.J. Eby wrote:


At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:
I still don't see why the environ should have multiple versions of  
anything. It's not as if the HTTP request gives us multiple Request- 
URI's. There's a single processing step that has to happen  
somewhere: decoding the bytes of the Request-URI to unicode. For  
the vast majority of apps, it should only happen once. Twice is  
acceptable to me for some apps. As I pointed out in the linked  
email, doing that as soon as possible (i.e. in the WSGI origin  
server) allows URI's to be compared as character strings more  
easily. If you deploy a piece of middleware that transcodes (based  
on more information than servers want to deal with), it had better  
be nearly first in the stack so routing works reliably.


The problem with this whole approach is that it's not composable.   
You can't stick in an application under a router that uses a  
different method for grokking its subtree of the URI space, unless  
it knows what's been done to the URI and can un-do it.


Maybe I'm missing something here, but the only way I see to preserve  
composability here is to use latin-1 or bytes.


The fundamental problem is that, like it or not, HTTP headers are  
actually byte strings.  The *only* reason we ever supported unicode  
in WSGI was to handle platforms where there's no such thing as a non- 
unicode string, and there we made it explicit that it's just a way  
of manipulating *bytes*, not unicode.


ISTM that very few (if any) of the proposals floating around for  
modifying WSGI are taking this concept into account.  Most of them  
sound to me like people saying, yeah, but this particular hack will  
work for *my* apps...  so everybody else must be doing something  
stupid.


But WSGI was built on the principle of *equally inconveniencing  
everyone*, specifically to avoid an impossible attempt at consensus  
between incompatible ways of doing things.  (E.g., nine million  
request/response APIs.)


So, if the only problem we're going to cause by using bytes  
everywhere is to make everyone need to change their routing code on  
Python 3, I vote +1000.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Mark Nottingham
Most things is not the Web. How will you handle serving images through  
WSGI? Compressed content?  PDFs?



On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
   Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most things
layered on top of wsgi are using utf-8 (django etc), and lots of web
clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Massimo Di Pierro

+1

On Sep 21, 2009, at 8:28 PM, Mark Nottingham wrote:


+1. There is no one answer for these issues (e.g., URI-IRI conversion
can lose information), so low-level infrastructure like WSGI shouldn't
be making choices for people.


On 22/09/2009, at 5:31 AM, P.J. Eby wrote:


At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:

I still don't see why the environ should have multiple versions of
anything. It's not as if the HTTP request gives us multiple Request-
URI's. There's a single processing step that has to happen
somewhere: decoding the bytes of the Request-URI to unicode. For
the vast majority of apps, it should only happen once. Twice is
acceptable to me for some apps. As I pointed out in the linked
email, doing that as soon as possible (i.e. in the WSGI origin
server) allows URI's to be compared as character strings more
easily. If you deploy a piece of middleware that transcodes (based
on more information than servers want to deal with), it had better
be nearly first in the stack so routing works reliably.


The problem with this whole approach is that it's not composable.
You can't stick in an application under a router that uses a
different method for grokking its subtree of the URI space, unless
it knows what's been done to the URI and can un-do it.

Maybe I'm missing something here, but the only way I see to preserve
composability here is to use latin-1 or bytes.

The fundamental problem is that, like it or not, HTTP headers are
actually byte strings.  The *only* reason we ever supported unicode
in WSGI was to handle platforms where there's no such thing as a non-
unicode string, and there we made it explicit that it's just a way
of manipulating *bytes*, not unicode.

ISTM that very few (if any) of the proposals floating around for
modifying WSGI are taking this concept into account.  Most of them
sound to me like people saying, yeah, but this particular hack will
work for *my* apps...  so everybody else must be doing something
stupid.

But WSGI was built on the principle of *equally inconveniencing
everyone*, specifically to avoid an impossible attempt at consensus
between incompatible ways of doing things.  (E.g., nine million
request/response APIs.)

So, if the only problem we're going to cause by using bytes
everywhere is to make everyone need to change their routing code on
Python 3, I vote +1000.  ;-)

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/mnot%40mnot.net



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/mdipierro%40cti.depaul.edu


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 Most things is not the Web. How will you handle serving images through WSGI?
 Compressed content?  PDFs?

You are perhaps misunderstanding something. A WSGI application still
should return bytes.

The whole concept of any sort of fallback to allow unicode data to be
returned for response content was purely so the canonical hello world
application as per Python 2.X could still be used on Python 3.X.

So, we aren't saying that the only thing WSGI applications can return
is unicode strings for response content.

Have you read my original blog post that triggered all this discussion
this time around?

Graham

 On 22/09/2009, at 1:30 AM, René Dudfield wrote:

 here is a summary:
   Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most things
 layered on top of wsgi are using utf-8 (django etc), and lots of web
 clients are using utf-8 (firefox etc).

 Why not move to unicode?


 --
 Mark Nottingham     http://www.mnot.net/

 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:
 http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Mark Nottingham

Reference?


On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:


2009/9/22 Mark Nottingham m...@mnot.net:
Most things is not the Web. How will you handle serving images  
through WSGI?

Compressed content?  PDFs?


You are perhaps misunderstanding something. A WSGI application still
should return bytes.

The whole concept of any sort of fallback to allow unicode data to be
returned for response content was purely so the canonical hello world
application as per Python 2.X could still be used on Python 3.X.

So, we aren't saying that the only thing WSGI applications can return
is unicode strings for response content.

Have you read my original blog post that triggered all this discussion
this time around?

Graham


On 22/09/2009, at 1:30 AM, René Dudfield wrote:


here is a summary:
  Apart from python3 compatibility(which should be good enough
reason), utf-8 is what's used in http a lot these days.  Most things
layered on top of wsgi are using utf-8 (django etc), and lots of web
clients are using utf-8 (firefox etc).

Why not move to unicode?



--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:
http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com




--
Mark Nottingham http://www.mnot.net/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
2009/9/22 Mark Nottingham m...@mnot.net:
 Reference?

See:

  http://blog.dscpl.com.au/2009/09/roadmap-for-python-wsgi-specification.html

Anyone else jumping in on this conversation with their own opinions
and who has not read it, should perhaps at least read that. Also read
some of the earlier posts in the numerous discussions this spawned at:

  http://groups.google.com/group/python-web-sig?lnk=

as the current thinking isn't exactly what I blogged about and has
shifted a bit as the discussion has progressed.

Graham

 On 22/09/2009, at 12:07 PM, Graham Dumpleton wrote:

 2009/9/22 Mark Nottingham m...@mnot.net:

 Most things is not the Web. How will you handle serving images through
 WSGI?
 Compressed content?  PDFs?

 You are perhaps misunderstanding something. A WSGI application still
 should return bytes.

 The whole concept of any sort of fallback to allow unicode data to be
 returned for response content was purely so the canonical hello world
 application as per Python 2.X could still be used on Python 3.X.

 So, we aren't saying that the only thing WSGI applications can return
 is unicode strings for response content.

 Have you read my original blog post that triggered all this discussion
 this time around?

 Graham

 On 22/09/2009, at 1:30 AM, René Dudfield wrote:

 here is a summary:
  Apart from python3 compatibility(which should be good enough
 reason), utf-8 is what's used in http a lot these days.  Most things
 layered on top of wsgi are using utf-8 (django etc), and lots of web
 clients are using utf-8 (firefox etc).

 Why not move to unicode?


 --
 Mark Nottingham     http://www.mnot.net/

 ___
 Web-SIG mailing list
 Web-SIG@python.org
 Web SIG: http://www.python.org/sigs/web-sig
 Unsubscribe:

 http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com



 --
 Mark Nottingham     http://www.mnot.net/


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
Armin has fast asleep now, so my shift. :-)

He did point me to this specific email for closer attention,
indicating issues with QUERY_STRING and wsgi.uri_encoding due to
something mentioned here. I didn't quite get what he was talking
about, but then I believe he has some wrong statements in his PEP-XXX
about QUERY_STRING. I'll make a a few of my own comments about this
email, and then maybe those who are still awake can help in
understanding issues raised here.

2009/9/22 And Clover and...@doxdesk.com:
 Armin Ronacher wrote:

 The middleware can never know.

 It's much more likely than to know than the server though!

 WSGI will demand UTF-8 URLs and only
 provide iso-XXX support for backwards compatibility.

 It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs
 break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm
 as much an advocate of UTF-8 for everything everywhere! as anyone else,
 but unfortunately today there are still dark places where you need non-UTF-8
 URLs.

The URLs don't break. As mentioned elsewhere, but perhaps not overly
clear is that if it is known that an application or some subset of
URLs will always be receiving a request as non UTF-8, then it should
employ code in those cases to always transcode it to the required
encoding. Thus something like:

import codecs
iso_8859_7 = codecs.lookup('iso-8859-7')

def redecode(string, encoding):
return string.encode(encoding).decode('iso-8859-7')

if codecs.lookup(environ['wsgi.uri_encoding']) != iso_8859_7:
environ['PATH_INFO'] = redecode(environ['PATH_INFO'],
environ['wsgi.uri_encoding'])
environ['SCRIPT_NAME'] = redecode(environ['SCRIPT_NAME'],
environ['wsgi.uri_encoding'])
environ['wsgi.uri_encoding'] = 'iso-8859-7'

This could be a part of the actual application if needing to be
selective based on URLs, or as a WSGI middleware that can adjust it
and which wraps the WSGI application.

The other fallback is that a specific WSGI server could elect to
provide an option to not use 'UTF-8' as the first choice for decoding
and instead use a user supplied value via the WSGI servers
configuration. Robert already showed as pseudo code what the WSGI
server would do:

   try:
   decode_uri(userdefault or 'utf-8')
   except UnicodeDecodeError:
   decode_uri('iso-8859-1')

For a pure Python WSGI server, which effectively only supports
mounting at root of site, then this may apply to whole site. In
Apache/mod_wsgi however, where using Location directive in Apache one
can easily apply configuration to a sub set of URLs, one could be more
selective. It gets more complicated when one talks about composition
of disparate WSGI components as part of an application stack.

Now, although having the configuration be done outside of the WSGI
application and in the web server will not appeal to some, it still
may be a useful fallback for where people don't want to have to fiddle
with using WSGI middleware wrappers around their whole application or
around individual components to do it.

Anyway, there are multiple options here.

 Incidentally, if wsgi.uri_encoding is going to be the way to signal that the
 server has decoded bytes to characters using a known encoding, it should be
 stressed that this should only be set when that encoding is certain.

 That is, wsgi.uri_encoding should be omitted (or None?) in cases where
 another party has already decoded (and maybe mangled) the bytes using an
 unknown encoding. In particular, CGI.

Yes, it is known that CGI and Python 3.X will be a problem. There has
been a number of discussions which raised the CGI issues in the past.
This time around we were possibly ignoring it for time being so that
CGI script compatibility wasn't going to exclusively override us
trying to make something that would work sanely for more up to date
hosting methods.

So, yes, having wsgi.uri_encoding be set to None for where not able to
be determined what encoding is would be sensible. It may be the case
that in such situations the only thing people can portably rely on is
being able to use ASCII. If they know for sure what is used, they
could set wsgi.uri_encoding themselves in a WSGI middleware wrapper
around their application, or CGI/WSGI adapter could provide an option
to allow user to set it so WSGI adapter uses user value but otherwise
leaves the variables as they were.

 (In the case of Windows CGI the server will have decoded URI bytes into
 Unicode characters, using a charset which it is impossible to find out. In
 Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF
 sequence, otherwise it's the system codepage. This problem affects the
 non-CGI implementation isapi_wsgi, too. Then the variables are read as
 environment variables, which for Python 2 means another encode/decode step
 on Windows using the system codepage, mangling non-codepage characters.
 Python 3 has the opposite problem reading byte envvars using 

Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Robert Brewer
Henry Precheur wrote:
 On Mon, Sep 21, 2009 at 03:26:35PM -0700, Robert Brewer wrote:
  It looks simpler until you have a site that is not primarily utf-8.
  In that case, you multiply your (1 line * number of middlewares in
the
  WSGI
  stack * each request).
  With wsgi.uri_encoding you get either (1 line * 1
  middleware designed to transcode * each request), or even 0 if your
  whole site uses just one charset.
 
 I am not sure I understand your point.
 
 The 0 lines hold true if the whole site is using latin-1 or utf-8 and
 you write your applications/middlewares only for this site. But if
it's
 using any other encoding you still have to transcode.
 
 def middleware(start_response, environ):
 value = environ['some_key'].\
 encode('utf8', 'surrogateescape').\
 decode(SITE_ENCODING)
 ...

Yes; you have to transcode to the correct encoding. Once. Then every
other WSGI application interface below that one doesn't have to care.

 With wsgi.uri_encoding you would still have to do the following:
 
 def middleware(start_response, environ):
 value = environ['some_key'].\
 encode(environ['some_key.encoding']).\
 decode(SITE_ENCODING)
 ...
 
 Of course you can directly use `environ['some_key']` if you know
you'll
 get the 'right' encoding all the time. But when the encoding changes,
 you'll have to fix all your middlewares.

The decoding doesn't change spontaneously. You either get the correct
one or you get an incorrect one. If it's incorrect, you fix it, one
time, via a WSGI component which you've configured to determine the
correct decoding. Then every other WSGI component below that one can
go back to trusting the decoding was correct. In fact, if you do that
transcoding right away, no other WSGI components need to be rewritten to
take advantage of unicode. You just have to deploy a single transcoder,
that's 6 lines of code max. I know PJE will chime in here and say you
can't deploy a website that works differently if you happen to forget to
turn on a given piece of middleware, but I also know the rest of you
will drown him out from personal experience because you've *never* done
that. ;)

With utf8+surrogateescape, you don't transcode once, you transcode in
every WSGI component in your stack that needs to correct the decoding.
You have to do it more than once because, each time you
encode/re-decode, you use the result and then throw it away. Any
subsequent WSGI components have to encode/re-decode--you cannot store
the redecoded URI in SCRIPT_NAME/PATH_INFO, because the
utf8+surrogateescape scheme says...well, it's always utf8-decoded. In
addition, *every* component that needs to compare URI's then has to be
configured with the same logic, however convoluted, to perform the
correct decoding again. It's not just routing middleware: caches need
to reliably compare decoded URI's; so do sessions; so does auth
(especially!); so do static files. And Heaven forfend you actually
decode differently in two different components!


Robert Brewer
fuman...@aminus.org

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Henry Precheur
On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote:
 The decoding doesn't change spontaneously.
 You either get the correct one or you get an incorrect one. If it's
 incorrect, you fix it, one time, via a WSGI component which you've
 configured to determine the correct decoding. Then every other WSGI
 component below that one can go back to trusting the decoding was
 correct. In fact, if you do that transcoding right away, no other WSGI
 components need to be rewritten to take advantage of unicode. You just
 have to deploy a single transcoder, that's 6 lines of code max.

And you can do that with utf8+surrogateescape too. Except that you don't
have to determine what encoding the gateway sent you, it's always
utf8+surrogateescape.

 With utf8+surrogateescape, you don't transcode once, you transcode in
 every WSGI component in your stack that needs to correct the
 decoding. You have to do it more than once because, each time you
 encode/re-decode, you use the result and then throw it away. Any
 subsequent WSGI components have to encode/re-decode--you cannot store
 the redecoded URI in SCRIPT_NAME/PATH_INFO, because the
 utf8+surrogateescape scheme says...well, it's always utf8-decoded.

You don't get something REALLY important with surrogateescape: You can
ALWAYS get the original bytes back.

 b = b'fran\xe7cois'
 s = b.decode('utf8', 'surrogateescape')
 s
'fran\udce7cois'
 s.encode('utf8', 'surrogateescape')
b'fran\xe7cois'

See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a
normal UTF-8 character, this character use some 'free space' in the
unicode supplementary characters.

The only thing you have to do is to pass 'surrogateescape' each time you
call encode/decode.

 In addition, *every* component that needs to compare URI's then has to
 be configured with the same logic, however convoluted, to perform the
 correct decoding again. It's not just routing middleware: caches
 need to reliably compare decoded URI's; so do sessions; so does auth
 (especially!); so do static files. And Heaven forfend you actually
 decode differently in two different components!

I don't understand why I would need to throw away the decoded string.

This works perfectly well a far as I know:

environ['PATH_INFO'] = environ['PATH_INFO'].\
  encode('utf8', 'surrogateescape').\
  decode(SITE_ENCODING)

utf8+surrogateescape provides the same possibilities as
wsgi.uri_encoding. You can transcode without losing information when you
know what the correct encoding is. But utf8+surrogateescape is simpler
because there's no need to pass around the name of the encoding in an
additional variable.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 07:40 PM 9/21/2009 -0700, Robert Brewer wrote:

Yes; you have to transcode to the correct encoding. Once. Then every
other WSGI application interface below that one doesn't have to care.


You can only do that if you *break encapsulation*, which as I said 
earlier is voiding the entire point of having a modular interface.


Having a configurable encoding just means that *every* WSGI 
application *must* verify the encoding in order to be safe.  I'm all 
in favor of making everyone suffer equally, but all else being equal, 
I'd prefer them to suffer idempotently rather than conditionally.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 07:21 PM 9/21/2009 -0700, Robert Brewer wrote:
I've never proposed that WSGI make choices for people. I'm simply 
saying that a configurable server, with a sane, perfectly-reversible 
default, is the simplest thing that could possibly work.


Actually, latin-1 bytes encoding is the *simplest* thing that could 
possibly work, since it works already in e.g. Jython, and is actually 
in the spec already...  and any framework that wants unicode URIs 
already has to decode them, so the code is already written. 


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Ian Bicking
On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton 
graham.dumple...@gmail.com wrote:

  Of course you can directly use `environ['some_key']` if you know you'll
  get the 'right' encoding all the time. But when the encoding changes,
  you'll have to fix all your middlewares.
 
 
  I am missing something?

 For one, we aren't talking about arbitrary keys needing this treatment.

 We are only talking about SCRIPT_NAME and PATH_INFO.


OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and
introduce two equivalent variables that hold the NOT url-decoded values.  So
if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'.

This will be quite disruptive, as these are variables that are frequently
accessed directly (libraries that expose them as attributes can just turn
them into properties that do URL decoding, using UTF8).  But it's an easy
fix at least.  I would actually want to specify that if we added this key,
we should disallow the old keys -- terrible confusion could ensue from both
in the environ.  This also fixes the problem with not being able to
distinguish %2F from /, which isn't a big problem but is annoying, and is
hiding meaningful information.  (I believe the relevant spec does
distinguish between these two values -- i.e., ideally decoding should happen
on path segments, each segment separated by a real /.)

If we do that, then the only really tricky thing left is HTTP_COOKIE, and
since the Cookie header is a mess then HTTP_COOKIE will be a mess and we
just have to figure out a hacky way to deal with that.  Maybe
surrogateescape, but probably just Latin1 would be fine (and easy to do in
Python 2).

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
2009/9/22 Ian Bicking i...@colorstudy.com:
 On Mon, Sep 21, 2009 at 6:16 PM, Graham Dumpleton
 graham.dumple...@gmail.com wrote:

  Of course you can directly use `environ['some_key']` if you know you'll
  get the 'right' encoding all the time. But when the encoding changes,
  you'll have to fix all your middlewares.
 
 
  I am missing something?

 For one, we aren't talking about arbitrary keys needing this treatment.

 We are only talking about SCRIPT_NAME and PATH_INFO.

 OK, another proposal entirely: we kill SCRIPT_NAME and PATH_INFO, and
 introduce two equivalent variables that hold the NOT url-decoded values.  So
 if you request /fran%e7cois then environ['PATH_INFO_RAW'] is '/fran%e7cois'.
 This will be quite disruptive, as these are variables that are frequently
 accessed directly (libraries that expose them as attributes can just turn
 them into properties that do URL decoding, using UTF8).  But it's an easy
 fix at least.  I would actually want to specify that if we added this key,
 we should disallow the old keys -- terrible confusion could ensue from both
 in the environ.  This also fixes the problem with not being able to
 distinguish %2F from /, which isn't a big problem but is annoying, and is
 hiding meaningful information.  (I believe the relevant spec does
 distinguish between these two values -- i.e., ideally decoding should happen
 on path segments, each segment separated by a real /.)
 If we do that, then the only really tricky thing left is HTTP_COOKIE, and
 since the Cookie header is a mess then HTTP_COOKIE will be a mess and we
 just have to figure out a hacky way to deal with that.  Maybe
 surrogateescape, but probably just Latin1 would be fine (and easy to do in
 Python 2).

That may be fine for pure Python web servers where you control the
split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place
but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as
that is done by the web server. Also, as pointed out in my blog,
because of rewrites in web server, it may be difficult to try and map
SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and
reclaim original characters. There is also the problem that often
FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and
manual overrides needed to tweak them.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Ian Bicking
On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton 
graham.dumple...@gmail.com wrote:

 That may be fine for pure Python web servers where you control the
 split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place
 but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as
 that is done by the web server. Also, as pointed out in my blog,
 because of rewrites in web server, it may be difficult to try and map
 SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and
 reclaim original characters. There is also the problem that often
 FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and
 manual overrides needed to tweak them.


When things get messed up I recommend people use a middleware
(paste.deploy.config.PrefixMiddleware, though I don't really care what they
use) to fix up the request to be correct.  Pulling it from REQUEST_URI would
be fine.
Also, at worst, you can do environ['SCRIPT_NAME_RAW'] =
urllib.quote(environ.pop('SCRIPT_NAME')).  It sucks, but if that's all the
information you have, then that's all the information you have.  Or try to
get the information from REQUEST_URI the hard way, once at the gateway
level.

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Graham Dumpleton
2009/9/22 Ian Bicking i...@colorstudy.com:
 On Tue, Sep 22, 2009 at 12:21 AM, Graham Dumpleton
 graham.dumple...@gmail.com wrote:

 That may be fine for pure Python web servers where you control the
 split of REQUEST_URI into SCRIPT_NAME and PATH_INFO in the first place
 but don't have that luxury in Apache or via FASTCGI/SCGI/CGI etc as
 that is done by the web server. Also, as pointed out in my blog,
 because of rewrites in web server, it may be difficult to try and map
 SCRIPT_NAME and PATH_INFO back into REQUEST_URI provided to try and
 reclaim original characters. There is also the problem that often
 FASTCGI totally stuffs up SCRIPT_NAME/PATH_INFO split anyway and
 manual overrides needed to tweak them.

 When things get messed up I recommend people use a middleware
 (paste.deploy.config.PrefixMiddleware, though I don't really care what they
 use) to fix up the request to be correct.  Pulling it from REQUEST_URI would
 be fine.
 Also, at worst, you can do environ['SCRIPT_NAME_RAW'] =
 urllib.quote(environ.pop('SCRIPT_NAME')).  It sucks, but if that's all the
 information you have, then that's all the information you have.  Or try to
 get the information from REQUEST_URI the hard way, once at the gateway
 level.

Probably doable to just reverse it using underlying raw bytes. At
least in mod_wsgi the SCRIPT_NAME/PATH_INFO split is always correct,
unless people really screw it up by using WSGIScriptAliasMatch or
AliasMatch wrongly.

If doing something like you suggest, would prefer them as 'wsgi.'
prefixed variables and not put in all upper case namespace to be
confused with CGI variables etc.

Graham

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread Ian Bicking
On Tue, Sep 22, 2009 at 12:38 AM, Graham Dumpleton 
graham.dumple...@gmail.com wrote:

 If doing something like you suggest, would prefer them as 'wsgi.'
 prefixed variables and not put in all upper case namespace to be
 confused with CGI variables etc.


I just had to make up a name, but I agree with your suggestion for wsgi.X
(we already have wsgi.url_scheme, after all).

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread P.J. Eby

At 02:30 PM 9/22/2009 +1000, Graham Dumpleton wrote:

Someone did say something about being able to half make it work on
Python 2.X. Can someone properly provide example code for Python 2.X.


The issue is that error handlers on encode are only allowed to 
provide substitute unicode characters, not substitute bytes.  That's 
why it can only half work on 2.x.




If we want uniformity in how interface works on Python 2.X and 3.X,
they we have to be able to use same method without tricks. This is why
wsgi.uri_encoding at the moment seems better, as not reliant on a
feature only in Python 3.1+.


If we want uniformity in the interface, then we should continue using 
latin-1, which already works today.  Yes, it sucks, but it sucks *uniformly*.


There really isn't going to be a solution that satisfies *all* of the 
criteria we're batting around, for *all* the users.  What's happening 
is that the principals are focused on different scenarios, where all 
their criteria can be met at the expense of others'.


I'm tending to flip-flop a bit myself, because my goal is that 
*nobody* wins, in the sense of having an advantaged framework, 
server, programming paradigm, etc. relative to others.  And that 
means there are more ways of doing it that would be acceptable to 
me.  For example, all bytes, all latin-1, all surrogateescape...  I 
don't care all that terribly much between them, I just want it to be 
uniform for everybody using/implementing the spec.  (And that also 
means I want it uniform across all keys, not just the URI ones; I 
don't want to have to remember which ones are special cases.)


If some people need to do more code because of their particular codec 
requirements, that's okay by me, as long as it's *unconditional* code 
that doesn't depend on some sort of configuration rigamarole.  That 
makes the spec brittle, because nobody's going to test their edge 
cases, and then the consumers of the code are gonna be the ones 
getting screwed over.


Frankly, 90% of WSGI code written will never even check the 
wsgi.version number, so why would we think anybody's going to 
actually check wsgi.url_encoding?  That's just building in the suck 
from day one.  No offense intended to the proposer of it; it's a fine 
solution for a single project's API, but it's just not going to scale.


We already know this, because most WSGI code written is not to 
spec.  The ones of us here in the room talking about this are *not* 
good examples of average WSGI programmers, because (hopefully) we've 
all at least studied the spec and endeavored to fully grok and 
conform to it.  (Hell, an unfortunately large number of people think 
you're supposed to use write() or yield to send *individual lines* of text.)


So you better believe that everybody else is going to copy the worst 
available examples of other people's WSGI code and ignore any 
documentation associated with it...  and then they will expect it to 
work on your server.  ;-)


Thus, our target audience is people who will rotely copy...  which 
means we need an API they can either copy by rote, or know is wrong 
when they get an error message.  Conditionals and error handling are 
too much to ask of them, as is remembering different rules for 
different environ keys that all kind of look alike.  (There's a 
reason we required ALL_CAPS keys to be the same type in the first spec.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Armin Ronacher
Hello everybody,

Thanks to Graham Dumpleton and Robert Brewer there is some serious
progress on WSGI currently.  I proposed a roadmap with some PEP changes
now that need some input.

Summary:

  WSGI 1.0   stays the same as PEP 0333 currently is
  WSGI 1.1   becomes what Ian and I added to PEP 0333
  WSGI 2.0   becomes a unicode powered version of WSGI 1.1
  WSGI 3.0   becomes WSGI 2.0 just without start_response

  WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python
  3 because of changes in the standard library that no longer work with
  a byte-only approach.


The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/
Neither the wording not the changes in there are anywhere near final.


Graham wrote down two questions he wants every major framework developer
to be answered.  These should guide the way to new WSGI standards:

1. Do we keep bytes everywhere forever in Python 2.X, or try to
   introduce unicode there at all to at least mirror what changes might
   be made to make WSGI workable in Python 3.X?

2. Do we skip WSGI 1.X completely for Python 3.X and go straight to
   WSGI 2.0 for Python 3.X?

I added a new question I think should be asked too:

3. Do we skip WSGI 2.0 as specified in the PEP and go straight to
   WSGI 3.0 and drop start_response?


The following things became pretty clear when playing around with
various specifications on Python 3:

-  Python 3 no longer implicitly converts between unicode and byte
   strings.  This covers comparisons, the regular expression engine,
   all string functions and many modules in the stdlib.

-  The Python 3 stdlib radically moved to unicode for non unicode things
   as well (the http servers, http clients, url handling etc.)

-  A byte only version of WSGI appears unrealistic on Python 3 because
   it would require server and middleware implementors to reimplement
   parts of the standard library to work on bytes again.

-  unicode support can be added for WSGI on both Python 2.x and Python
   3.x without removing functionality.  Browsers are already doing
   a similar encoding trick as proposed by Graham Dumpleton to handle
   URLs.

-  Python 2.x already accepts unicode strings for many things such as
   URL handling thanks to the fact that unicode and byte strings are
   surprisingly interchangeable.

-  cgi.FieldStorage and some other parts is now totally broken on
   Python 3 and should no longer be used in 3.0 and 3.1 because it
   reads the response body into memory.  This currently affects
   WebOb, Pylons and TurboGears.


I sent this mail to every major framework / WSGI implementor so that we
get input even if you're missing the discussion on web-sig.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread P.J. Eby

At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote:

Hello everybody,

Thanks to Graham Dumpleton and Robert Brewer there is some serious
progress on WSGI currently.  I proposed a roadmap with some PEP changes
now that need some input.

Summary:

  WSGI 1.0   stays the same as PEP 0333 currently is
  WSGI 1.1   becomes what Ian and I added to PEP 0333
  WSGI 2.0   becomes a unicode powered version of WSGI 1.1
  WSGI 3.0   becomes WSGI 2.0 just without start_response


Since there's already a well-established notion of WSGI 2.0 being the 
new calling convention, I would suggest (to avoid confusion) renaming 
your 2.0 to 1.2 or 1.5 or something instead.




  WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python
  3 because of changes in the standard library that no longer work with
  a byte-only approach.


This is unfortunate, but it should probably be considered a 
bellwether for Python 3 porting in general, alas.  The Python 3 
stdlib *should* work with bytes, and the fact that it does not should 
be treated as a bug in the stdlib rather than something to be worked 
around in WSGI.




Graham wrote down two questions he wants every major framework developer
to be answered.  These should guide the way to new WSGI standards:

1. Do we keep bytes everywhere forever in Python 2.X, or try to
   introduce unicode there at all to at least mirror what changes might
   be made to make WSGI workable in Python 3.X?


Technically, we are not using bytes but native strings, i.e. type 
'str'.  What benefit would introducing unicode produce?




2. Do we skip WSGI 1.X completely for Python 3.X and go straight to
   WSGI 2.0 for Python 3.X?


This discussion has been going on for so long that I've already 
forgotten what the problem was with just using the original 1.0 spec 
for 3.X, i.e., using native strings for everything, using latin-1 
encoding.  The only things I can recall off the top of my head are 
that the input stream would still be bytes, and that the environment 
might've used a different encoding.


I don't know if such an approach should actually be *recommended*, 
but having a migration path for WSGI 1.0- Python 3.X sounds like a 
good idea, if it can be done strictly as errata/clarification of the 
existing spec.  Otherwise, might as well forget the whole thing and 
go straight to the latest and greatest (i.e. what has previously been 
called 2.0 and you're calling 3.0.)




I added a new question I think should be asked too:

3. Do we skip WSGI 2.0 as specified in the PEP and go straight to
   WSGI 3.0 and drop start_response?


I suggest skipping straight to the latest and greatest with no 
in-betweens at all, other than errata/clarifications on 1.0.  Having 
lots of variations of a standard is a bug, not a feature!





The following things became pretty clear when playing around with
various specifications on Python 3:

-  Python 3 no longer implicitly converts between unicode and byte
   strings.  This covers comparisons, the regular expression engine,
   all string functions and many modules in the stdlib.
-  The Python 3 stdlib radically moved to unicode for non unicode things
   as well (the http servers, http clients, url handling etc.)

-  A byte only version of WSGI appears unrealistic on Python 3 because
   it would require server and middleware implementors to reimplement
   parts of the standard library to work on bytes again.


IMO, this strongly suggests that it's the stdlib or Python 3 that's 
broken here.  How much of the stdlib are we talking about needing to 
reimplement, aside from cgi.FieldStorage?


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Armin Ronacher
Hi,

P.J. Eby schrieb:
 This discussion has been going on for so long that I've already 
 forgotten what the problem was with just using the original 1.0 spec 
 for 3.X, i.e., using native strings for everything, using latin-1 
 encoding.  The only things I can recall off the top of my head are 
 that the input stream would still be bytes, and that the environment 
 might've used a different encoding.
Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and
many more technologies are based on unicode, even in Python 2.x.  They
are currently doing decoding of byte data internally.

In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever
we suddenly have different code paths for Python 3 and Python 2.
Because in Python 3 we suddendly already have unicode data.

You're assuming a situation where the applicaiton in Python 2.x was byte
based, but in the majority of cases this is never the situation.

 IMO, this strongly suggests that it's the stdlib or Python 3 that's 
 broken here.  How much of the stdlib are we talking about needing to 
 reimplement, aside from cgi.FieldStorage?
I'm already creating a patch for urllib which currently requires
unicode.  I'm not sure about what to do with cgi.FieldStorage, in
general I would not recommend using the cgi module for WSGI applications
at all!  If we would go with bytes for the WSGI 1.0 spec on Python 3 a
WSGI server also has to decode that data from the Server again.

Also (something I haven't yet filed as a bug because I guess there will
be more changes involved) the HTTP server in Python 3.1 does not support
non-ASCII headers.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread P.J. Eby

At 04:50 PM 9/20/2009 +0200, Armin Ronacher wrote:

Django, Pylons, SQLAlchemy, Mako, Jinja2, Genshi, Werkzeug, WebOb and
many more technologies are based on unicode, even in Python 2.x.  They
are currently doing decoding of byte data internally.

In Python 2.x if we stick to native strings for WSGI 2.0 / 1.5 whatever
we suddenly have different code paths for Python 3 and Python 2.
Because in Python 3 we suddendly already have unicode data.


No, you'd have bytes stored in a latin-1 string, which is not quite 
the same thing as already [having] unicode data.  You have to 
.encode('latin1').decode(targetencoding) if you want genuine unicode data.


If you're saying that people's code would have to change when they go 
to Python 3 (i.e., adding the extra .encode()), I think that's 
already a given for *any* non-trivial code, not just WSGI.




 IMO, this strongly suggests that it's the stdlib or Python 3 that's
 broken here.  How much of the stdlib are we talking about needing to
 reimplement, aside from cgi.FieldStorage?
I'm already creating a patch for urllib which currently requires
unicode.  I'm not sure about what to do with cgi.FieldStorage, in
general I would not recommend using the cgi module for WSGI applications
at all!


But people do, in fact, use it for WSGI on 2.x, so if having 
different code paths is a problem, certainly dropping the cgi 
module is at least as big of a problem, if not considerably more so.


I think one of the reasons that the current (and ongoing) PEP 
discussions have been foundering is that there isn't a clear 
delineation of goals at the high level, and rather just a bunch of 
tradeoff discussions, absent any criteria by which to make the tradeoffs.


To me, I'd rather see people port to a new WSGI spec (with a new 
calling convention) on Python 2, and only *then* transition to Python 
3.  If we do that well, then the intermediate pain disappears -- as 
does the pain and complexity of trying to make a bastardized 
in-between specification.  ;-)


Truth be told, we can probably do that new spec *faster* if we don't 
have to worry too much about backward compatibility, and just design 
it for the way things are now, instead of worrying about the 
past.  Even if we have to do some odd things inside a 2-to-1 
converter, there should ideally only have to be a handful of such 
converters ever written.


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Georg Brandl
P.J. Eby schrieb:

-  Python 3 no longer implicitly converts between unicode and byte
strings.  This covers comparisons, the regular expression engine,
all string functions and many modules in the stdlib.
-  The Python 3 stdlib radically moved to unicode for non unicode things
as well (the http servers, http clients, url handling etc.)

-  A byte only version of WSGI appears unrealistic on Python 3 because
it would require server and middleware implementors to reimplement
parts of the standard library to work on bytes again.
 
 IMO, this strongly suggests that it's the stdlib or Python 3 that's 
 broken here.  How much of the stdlib are we talking about needing to 
 reimplement, aside from cgi.FieldStorage?

FWIW, it's very much possible that the py3k stdlib is broken there.  Many
modules were ported with the aim get the test running again, and not
too much thought about bytes/unicode issues.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Graham Dumpleton
2009/9/21 Armin Ronacher armin.ronac...@active-4.com:
 IMO, this strongly suggests that it's the stdlib or Python 3 that's
 broken here.  How much of the stdlib are we talking about needing to
 reimplement, aside from cgi.FieldStorage?
 I'm already creating a patch for urllib which currently requires
 unicode.  I'm not sure about what to do with cgi.FieldStorage, in
 general I would not recommend using the cgi module for WSGI applications
 at all!  If we would go with bytes for the WSGI 1.0 spec on Python 3 a
 WSGI server also has to decode that data from the Server again.

 Also (something I haven't yet filed as a bug because I guess there will
 be more changes involved) the HTTP server in Python 3.1 does not support
 non-ASCII headers.

Read the following first:

   http://bugs.python.org/issue4953
   http://bugs.python.org/issue4661

There the ones I know about that affect cgi.FieldStorage.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Robert Brewer
Armin Ronacher wrote:
 Thanks to Graham Dumpleton and Robert Brewer there is some serious
 progress on WSGI currently.  I proposed a roadmap with some PEP
changes
 now that need some input.
 
 Summary:
 
   WSGI 1.0   stays the same as PEP 0333 currently is
   WSGI 1.1   becomes what Ian and I added to PEP 0333
   WSGI 2.0   becomes a unicode powered version of WSGI 1.1
   WSGI 3.0   becomes WSGI 2.0 just without start_response
 
   WSGI 1.0 and 1.1 are byte based and nearly impossible to use on
 Python
   3 because of changes in the standard library that no longer work
with
   a byte-only approach.
 
 
 The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/
 Neither the wording not the changes in there are anywhere near final.
 
 
 Graham wrote down two questions he wants every major framework
 developer
 to be answered.  These should guide the way to new WSGI standards:
 
 1. Do we keep bytes everywhere forever in Python 2.X, or try to
introduce unicode there at all to at least mirror what changes
might
be made to make WSGI workable in Python 3.X?

I'm happy either way, since CherryPy abstracts it all away. Decide
already and I'll implement it.

 2. Do we skip WSGI 1.X completely for Python 3.X and go straight to
WSGI 2.0 for Python 3.X?

+1 for skipping straight to unicode in Python 3. But call it 1.1 not
2.0.

 I added a new question I think should be asked too:
 
 3. Do we skip WSGI 2.0 as specified in the PEP and go straight to
WSGI 3.0 and drop start_response?

No. We need more time to discuss and try to implement the large
architectural changes in that. I need to ship CP 3.2 soon and would like
it to have a better Python 3 story than the bytes-everywhere (or
unicode pretending to be bytes) of WSGI 1.0. We have working code,
which uses unicode in Python 3. Maybe I'll call it wsgi.version = (1,
'cp32') and let the spec come later if we can't see the trees for the
forest.


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Robert Brewer
P.J. Eby wrote:
 At 03:06 PM 9/20/2009 +0200, Armin Ronacher wrote: 
 The following things became pretty clear when playing around with
 various specifications on Python 3:
 
 -  Python 3 no longer implicitly converts between unicode and byte
 strings.  This covers comparisons, the regular expression engine,
 all string functions and many modules in the stdlib.
 -  The Python 3 stdlib radically moved to unicode for non unicode
 things
 as well (the http servers, http clients, url handling etc.)
 
 -  A byte only version of WSGI appears unrealistic on Python 3
because
 it would require server and middleware implementors to
reimplement
 parts of the standard library to work on bytes again.
 
 IMO, this strongly suggests that it's the stdlib or Python 3 that's
 broken here.  How much of the stdlib are we talking about needing to
 reimplement, aside from cgi.FieldStorage?

urllib.unquote, for one. We had to make a version which accepts bytes
(and outputs bytes). But it's only 8 lines of code.


Robert Brewer
fuman...@aminus.org

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Chris McDonough

I'll try to digest some of this, currently I'm pretty clueless.

Personally, I find it a bit hard to get excited about Python 3 as a web 
application deployment platform.  This is of course a personal judgment (I 
don't mean to slight Python 3) but at this point, I'll think I'll probably be 
writing software that targets 2.X exclusively for at least the next five years.


Given this point of view, it would be extremely helpful if someone could 
explain to people with the same outlook why we should want to deal with Unicode 
strings in any WSGI specification.


WSGI is a fairly low-level protocol aimed at folks who need to interface a 
server to the outside world.  The outside world (by its nature) talks bytes.  I 
fear that any implied conversion of environment values and iterable return 
values to Unicode will actually eventually make things harder than they are 
now.  I realize that it would make middleware implementors lives harder to need 
to deal in bytes.  However, at this point, I also believe that middleware kinda 
should be hard.  We have way too much middleware that shouldn't be middleware 
these days (some written by myself).


Anyway, for us slower (and maybe wrongly fearful) folks, could someone 
summarize the benefits of having a WSGI specification that requires Unicode. 
Bonus points for an explanation that does not boil down to it will be 
compatible with Python 3.


- C


Armin Ronacher wrote:

Hello everybody,

Thanks to Graham Dumpleton and Robert Brewer there is some serious
progress on WSGI currently.  I proposed a roadmap with some PEP changes
now that need some input.

Summary:

  WSGI 1.0   stays the same as PEP 0333 currently is
  WSGI 1.1   becomes what Ian and I added to PEP 0333
  WSGI 2.0   becomes a unicode powered version of WSGI 1.1
  WSGI 3.0   becomes WSGI 2.0 just without start_response

  WSGI 1.0 and 1.1 are byte based and nearly impossible to use on Python
  3 because of changes in the standard library that no longer work with
  a byte-only approach.


The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/
Neither the wording not the changes in there are anywhere near final.


Graham wrote down two questions he wants every major framework developer
to be answered.  These should guide the way to new WSGI standards:

1. Do we keep bytes everywhere forever in Python 2.X, or try to
   introduce unicode there at all to at least mirror what changes might
   be made to make WSGI workable in Python 3.X?

2. Do we skip WSGI 1.X completely for Python 3.X and go straight to
   WSGI 2.0 for Python 3.X?

I added a new question I think should be asked too:

3. Do we skip WSGI 2.0 as specified in the PEP and go straight to
   WSGI 3.0 and drop start_response?


The following things became pretty clear when playing around with
various specifications on Python 3:

-  Python 3 no longer implicitly converts between unicode and byte
   strings.  This covers comparisons, the regular expression engine,
   all string functions and many modules in the stdlib.

-  The Python 3 stdlib radically moved to unicode for non unicode things
   as well (the http servers, http clients, url handling etc.)

-  A byte only version of WSGI appears unrealistic on Python 3 because
   it would require server and middleware implementors to reimplement
   parts of the standard library to work on bytes again.

-  unicode support can be added for WSGI on both Python 2.x and Python
   3.x without removing functionality.  Browsers are already doing
   a similar encoding trick as proposed by Graham Dumpleton to handle
   URLs.

-  Python 2.x already accepts unicode strings for many things such as
   URL handling thanks to the fact that unicode and byte strings are
   surprisingly interchangeable.

-  cgi.FieldStorage and some other parts is now totally broken on
   Python 3 and should no longer be used in 3.0 and 3.1 because it
   reads the response body into memory.  This currently affects
   WebOb, Pylons and TurboGears.


I sent this mail to every major framework / WSGI implementor so that we
get input even if you're missing the discussion on web-sig.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Massimo Di Pierro

+1

On Sep 20, 2009, at 11:25 PM, Chris McDonough wrote:


I'll try to digest some of this, currently I'm pretty clueless.

Personally, I find it a bit hard to get excited about Python 3 as a  
web
application deployment platform.  This is of course a personal  
judgment (I
don't mean to slight Python 3) but at this point, I'll think I'll  
probably be
writing software that targets 2.X exclusively for at least the next  
five years.


Given this point of view, it would be extremely helpful if someone  
could
explain to people with the same outlook why we should want to deal  
with Unicode

strings in any WSGI specification.

WSGI is a fairly low-level protocol aimed at folks who need to  
interface a
server to the outside world.  The outside world (by its nature)  
talks bytes.  I
fear that any implied conversion of environment values and iterable  
return
values to Unicode will actually eventually make things harder than  
they are
now.  I realize that it would make middleware implementors lives  
harder to need
to deal in bytes.  However, at this point, I also believe that  
middleware kinda
should be hard.  We have way too much middleware that shouldn't be  
middleware

these days (some written by myself).

Anyway, for us slower (and maybe wrongly fearful) folks, could someone
summarize the benefits of having a WSGI specification that requires  
Unicode.

Bonus points for an explanation that does not boil down to it will be
compatible with Python 3.

- C


Armin Ronacher wrote:

Hello everybody,

Thanks to Graham Dumpleton and Robert Brewer there is some serious
progress on WSGI currently.  I proposed a roadmap with some PEP  
changes

now that need some input.

Summary:

 WSGI 1.0   stays the same as PEP 0333 currently is
 WSGI 1.1   becomes what Ian and I added to PEP 0333
 WSGI 2.0   becomes a unicode powered version of WSGI 1.1
 WSGI 3.0   becomes WSGI 2.0 just without start_response

 WSGI 1.0 and 1.1 are byte based and nearly impossible to use on  
Python
 3 because of changes in the standard library that no longer work  
with

 a byte-only approach.


The PEPs themselves are here: http://bitbucket.org/ianb/wsgi-peps/
Neither the wording not the changes in there are anywhere near final.


Graham wrote down two questions he wants every major framework  
developer

to be answered.  These should guide the way to new WSGI standards:

1. Do we keep bytes everywhere forever in Python 2.X, or try to
  introduce unicode there at all to at least mirror what changes  
might

  be made to make WSGI workable in Python 3.X?

2. Do we skip WSGI 1.X completely for Python 3.X and go straight to
  WSGI 2.0 for Python 3.X?

I added a new question I think should be asked too:

3. Do we skip WSGI 2.0 as specified in the PEP and go straight to
  WSGI 3.0 and drop start_response?


The following things became pretty clear when playing around with
various specifications on Python 3:

-  Python 3 no longer implicitly converts between unicode and byte
  strings.  This covers comparisons, the regular expression engine,
  all string functions and many modules in the stdlib.

-  The Python 3 stdlib radically moved to unicode for non unicode  
things

  as well (the http servers, http clients, url handling etc.)

-  A byte only version of WSGI appears unrealistic on Python 3  
because

  it would require server and middleware implementors to reimplement
  parts of the standard library to work on bytes again.

-  unicode support can be added for WSGI on both Python 2.x and  
Python

  3.x without removing functionality.  Browsers are already doing
  a similar encoding trick as proposed by Graham Dumpleton to handle
  URLs.

-  Python 2.x already accepts unicode strings for many things such as
  URL handling thanks to the fact that unicode and byte strings are
  surprisingly interchangeable.

-  cgi.FieldStorage and some other parts is now totally broken on
  Python 3 and should no longer be used in 3.0 and 3.1 because it
  reads the response body into memory.  This currently affects
  WebOb, Pylons and TurboGears.


I sent this mail to every major framework / WSGI implementor so  
that we

get input even if you're missing the discussion on web-sig.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/chrism%40plope.com



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/mdipierro%40cs.depaul.edu


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-20 Thread Armin Ronacher
Hi,

Chris McDonough schrieb:
 Personally, I find it a bit hard to get excited about Python 3 as a web 
 application deployment platform.
Everybody feels that way currently.  But if we don't fix WSGI that will
never change.

 Given this point of view, it would be extremely helpful if someone could 
 explain to people with the same outlook why we should want to deal with 
 Unicode 
 strings in any WSGI specification.
I summarized the reasons in my mail.  Also have a look at the
discussions in this mailinglist that lead to that.


Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com