[Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Graham Dumpleton
Has anyone had any thoughts about how WSGI is going to made to work
with Python 3?

>From what I understand about changes in Python 3, the main issue seems
to be the removal of string type in its current form.

This is an issue as WSGI specification currently states that status,
header names/values and the items returned by the iterable must all be
string instances. This is done to ensure that the application has done
any conversions from Unicode, where knowledge about encoding would be
known, before being passed to WSGI adapter.

In Python 3 the default for string type objects will effectively be
Unicode. Is WSGI going to be made to somehow cope with that, or will
application instead be required to return byte string objects instead?

We can never seem to get enough momentum going for WSGI 2.0, but with
Python 3 coming along we may not have a choice but to come up with
revised version of specification if we want WSGI to continue through
to Python 3.

Comments.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Phillip J. Eby
At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote:
>Has anyone had any thoughts about how WSGI is going to made to work
>with Python 3?
>
> >From what I understand about changes in Python 3, the main issue seems
>to be the removal of string type in its current form.
>
>This is an issue as WSGI specification currently states that status,
>header names/values and the items returned by the iterable must all be
>string instances. This is done to ensure that the application has done
>any conversions from Unicode, where knowledge about encoding would be
>known, before being passed to WSGI adapter.
>
>In Python 3 the default for string type objects will effectively be
>Unicode. Is WSGI going to be made to somehow cope with that, or will
>application instead be required to return byte string objects instead?

WSGI already copes, actually.  Note that Jython and IronPython have 
this issue today, and see:

http://www.python.org/dev/peps/pep-0333/#unicode-issues

"""On Python platforms where the str or StringType type is in fact 
Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all 
"strings" referred to in this specification must contain only code 
points representable in ISO-8859-1 encoding (\u through \u00FF, 
inclusive). It is a fatal error for an application to supply strings 
containing any other Unicode character or code point. Similarly, 
servers and gateways must not supply strings to an application 
containing any other Unicode characters."""


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Guido van Rossum
On Dec 6, 2007 4:15 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote:
> >Has anyone had any thoughts about how WSGI is going to made to work
> >with Python 3?
> >
> > >From what I understand about changes in Python 3, the main issue seems
> >to be the removal of string type in its current form.
> >
> >This is an issue as WSGI specification currently states that status,
> >header names/values and the items returned by the iterable must all be
> >string instances. This is done to ensure that the application has done
> >any conversions from Unicode, where knowledge about encoding would be
> >known, before being passed to WSGI adapter.
> >
> >In Python 3 the default for string type objects will effectively be
> >Unicode. Is WSGI going to be made to somehow cope with that, or will
> >application instead be required to return byte string objects instead?
>
> WSGI already copes, actually.  Note that Jython and IronPython have
> this issue today, and see:
>
> http://www.python.org/dev/peps/pep-0333/#unicode-issues
>
> """On Python platforms where the str or StringType type is in fact
> Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
> "strings" referred to in this specification must contain only code
> points representable in ISO-8859-1 encoding (\u through \u00FF,
> inclusive). It is a fatal error for an application to supply strings
> containing any other Unicode character or code point. Similarly,
> servers and gateways must not supply strings to an application
> containing any other Unicode characters."""

That may work for IronPython/Jython, where encoded data is represented
by the str type, but it won't be sufficient for Py3k, where encoded
data is represented using the bytes type. IOW, in IronPython/Jython,
u"\u1234".encode('utf-8') returns a str instance: '\xe1\x88\xb4'; but
in Py3k, it returns a bytes instance: b'\xe1\x88\xb4'.

The issue applies to input as well as output -- data read from a
socket is also represented as bytes, unless you're using makefile()
with a text mode and an encoding.

You might want to look at how the unittests for wsgiref manage to pass
in Py3k though. ;-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread James Y Knight

On Dec 6, 2007, at 7:15 PM, Phillip J. Eby wrote:
> WSGI already copes, actually.  Note that Jython and IronPython have
> this issue today, and see:
>
> http://www.python.org/dev/peps/pep-0333/#unicode-issues
>
> """On Python platforms where the str or StringType type is in fact
> Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
> "strings" referred to in this specification must contain only code
> points representable in ISO-8859-1 encoding (\u through \u00FF,
> inclusive). It is a fatal error for an application to supply strings
> containing any other Unicode character or code point. Similarly,
> servers and gateways must not supply strings to an application
> containing any other Unicode characters."""

It would seem very odd, however, for WSGI/python3 to use strings- 
restricted-to-0xFF for network I/O while everywhere else in python3 is  
going to use bytes for the same purpose. You'd have to modify your app  
to call write(unicodetext.encode('utf-8').decode('latin-1')) or so

James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Phillip J. Eby
At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:

>On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
> > In Python 3 the default for string type objects will effectively be
> > Unicode. Is WSGI going to be made to somehow cope with that, or will
> > application instead be required to return byte string objects instead?
>
>I'd say it would be best to only accept `bytes` objects; anything else
>would require some guesswork. Maybe, at most, it could try to encode
>returned Unicode objects as ISO-8859-1, and have it be an error if
>that's not possible.

Actually, I'd prefer to look at it the other way around: a Python 3 
WSGI server or middleware *may* accept bytes objects instead of str.

This is relatively easy for the response side of things, but the 
request side is rather more difficult, since wsgi.input may need to 
be binary rather than text mode.  (I think we can reasonably assume 
that wsgi.errors is a text mode stream, and should support a 
reasonable encoding.)

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Phillip J. Eby
At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote:
>You might want to look at how the unittests for wsgiref manage to pass
>in Py3k though. ;-)

Unless they've been changed, I'd assume it's because they work with 
strings exclusively, and never do any encoding or decoding (which is 
outside WSGI's scope, at least in the current version).

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Adam Atlas

On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
> In Python 3 the default for string type objects will effectively be
> Unicode. Is WSGI going to be made to somehow cope with that, or will
> application instead be required to return byte string objects instead?

I'd say it would be best to only accept `bytes` objects; anything else  
would require some guesswork. Maybe, at most, it could try to encode  
returned Unicode objects as ISO-8859-1, and have it be an error if  
that's not possible.

I was going to say that the gateway could accept Unicode objects if  
the user-agent sent a comprehensible Accept-Charset header, and  
thereby encode application output to the client's preferred character  
set on the fly (or to ISO-8859-1 if no Accept-Charset is provided),  
but that would complicate things for people writing gateways (and  
would be too implicit). It could be useful, but it would make more  
sense as a simple decorator for (almost-)WSGI applications. Perhaps it  
could go in wsgiref.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread James Bennett
On Dec 6, 2007 6:15 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> WSGI already copes, actually.  Note that Jython and IronPython have
> this issue today, and see:
>
> http://www.python.org/dev/peps/pep-0333/#unicode-issues

I'm glad you brought that up, because it's been bugging me lately.

That section is somewhat ambiguous as-is, because in one sentence
applications are permitted to return strings encoded in a charset
other than ISO-8859-1, but in another they are unequivocally forbidden
to do so (with the "must not" in bold, even). And that's problematic
not only because of the ambiguity, but because the increasing
popularity of "AJAX" and web-based APIs is making it much more common
for WSGI applications to generate responses of types which do not
default to ISO-8859-1 -- e.g., XML and JSON, both of which default to
UTF-8.

Depending on how draconian one wishes to be when reading the relevant
section of WSGI, it's possible to conclude that XML and JSON must
always be transcoded/escaped to ISO-8859-1 -- with all the headaches
that entails -- before being passed to a WSGI-compliant piece of
software.

And the slightly less strict reading of the spec -- that such
gymnastics are required only when the string type of the Python
implementation is Unicode-based -- will grow increasingly troublesome
as/when Py3K enters production use.

So as long as we're talking about this, could the proscriptions with
respect to encoding perhaps be revisited and (hopefully)
clarified/revised?

-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Guido van Rossum
On Dec 6, 2007 5:45 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote:
> At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote:
> >You might want to look at how the unittests for wsgiref manage to pass
> >in Py3k though. ;-)
>
> Unless they've been changed, I'd assume it's because they work with
> strings exclusively, and never do any encoding or decoding (which is
> outside WSGI's scope, at least in the current version).

Indeed, that seems mostly to be the case. But this means that any
application that wants to emit characters outside Latin-1 cannot just
encode() those characters, since the encode() output will be bytes and
those will not be accepted by the WSGI API. OTOH sending non-Latin-1
characters without encoding would violate the standard. So something
needs to give...

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Ian Bicking
Phillip J. Eby wrote:
> At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:
> 
>> On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
>>> In Python 3 the default for string type objects will effectively be
>>> Unicode. Is WSGI going to be made to somehow cope with that, or will
>>> application instead be required to return byte string objects instead?
>> I'd say it would be best to only accept `bytes` objects; anything else
>> would require some guesswork. Maybe, at most, it could try to encode
>> returned Unicode objects as ISO-8859-1, and have it be an error if
>> that's not possible.
> 
> Actually, I'd prefer to look at it the other way around: a Python 3 
> WSGI server or middleware *may* accept bytes objects instead of str.
> 
> This is relatively easy for the response side of things, but the 
> request side is rather more difficult, since wsgi.input may need to 
> be binary rather than text mode.  (I think we can reasonably assume 
> that wsgi.errors is a text mode stream, and should support a 
> reasonable encoding.)

wsgi.input definitely seems like it should be bytes to me.  Unless we 
want to put the encoding process into the server.  Not entirely 
infeasible, but a bit of a strain.  And the request body might very well 
be binary, e.g., on a PUT.

The CGI keys in the environment don't feel at all like bytes to me, but 
then they aren't unicode either.  They can be unicode, again given a bit 
of work on the server side.  Though unfortunately browsers are very poor 
at indicating their encoding for requests, and it ends up being policy 
and configuration as much as anything that determines the encoding of 
stuff like wsgi.input.  I believe all request paths are UTF8 (?), but 
I'm not sure about QUERY_STRING.  I'm a little fuzzy on some of the 
details there.

The actual response body should also be bytes.  Unless again we want to 
introduce upstream encoding.

This does make everything feel more complicated.

-- 
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-06 Thread Guido van Rossum
On Dec 6, 2007 8:00 PM, Ian Bicking <[EMAIL PROTECTED]> wrote:
> Phillip J. Eby wrote:
> > At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:
> >
> >> On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
> >>> In Python 3 the default for string type objects will effectively be
> >>> Unicode. Is WSGI going to be made to somehow cope with that, or will
> >>> application instead be required to return byte string objects instead?
> >> I'd say it would be best to only accept `bytes` objects; anything else
> >> would require some guesswork. Maybe, at most, it could try to encode
> >> returned Unicode objects as ISO-8859-1, and have it be an error if
> >> that's not possible.
> >
> > Actually, I'd prefer to look at it the other way around: a Python 3
> > WSGI server or middleware *may* accept bytes objects instead of str.
> >
> > This is relatively easy for the response side of things, but the
> > request side is rather more difficult, since wsgi.input may need to
> > be binary rather than text mode.  (I think we can reasonably assume
> > that wsgi.errors is a text mode stream, and should support a
> > reasonable encoding.)
>
> wsgi.input definitely seems like it should be bytes to me.  Unless we
> want to put the encoding process into the server.  Not entirely
> infeasible, but a bit of a strain.  And the request body might very well
> be binary, e.g., on a PUT.
>
> The CGI keys in the environment don't feel at all like bytes to me, but
> then they aren't unicode either.  They can be unicode, again given a bit
> of work on the server side.  Though unfortunately browsers are very poor
> at indicating their encoding for requests, and it ends up being policy
> and configuration as much as anything that determines the encoding of
> stuff like wsgi.input.  I believe all request paths are UTF8 (?), but
> I'm not sure about QUERY_STRING.  I'm a little fuzzy on some of the
> details there.
>
> The actual response body should also be bytes.  Unless again we want to
> introduce upstream encoding.
>
> This does make everything feel more complicated.

It's the same level of complexity you run into as soon as you want to
handle Unicode with WSGI in 2.x though, as it is caused by something
outside our control (HTTP and browsers).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com