[Web-SIG] WSGI, Python 3 and Unicode
Has anyone had any thoughts about how WSGI is going to made to work with Python 3? >From what I understand about changes in Python 3, the main issue seems to be the removal of string type in its current form. This is an issue as WSGI specification currently states that status, header names/values and the items returned by the iterable must all be string instances. This is done to ensure that the application has done any conversions from Unicode, where knowledge about encoding would be known, before being passed to WSGI adapter. In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? We can never seem to get enough momentum going for WSGI 2.0, but with Python 3 coming along we may not have a choice but to come up with revised version of specification if we want WSGI to continue through to Python 3. Comments. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote: >Has anyone had any thoughts about how WSGI is going to made to work >with Python 3? > > >From what I understand about changes in Python 3, the main issue seems >to be the removal of string type in its current form. > >This is an issue as WSGI specification currently states that status, >header names/values and the items returned by the iterable must all be >string instances. This is done to ensure that the application has done >any conversions from Unicode, where knowledge about encoding would be >known, before being passed to WSGI adapter. > >In Python 3 the default for string type objects will effectively be >Unicode. Is WSGI going to be made to somehow cope with that, or will >application instead be required to return byte string objects instead? WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues """On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.""" ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 4:15 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote: > >Has anyone had any thoughts about how WSGI is going to made to work > >with Python 3? > > > > >From what I understand about changes in Python 3, the main issue seems > >to be the removal of string type in its current form. > > > >This is an issue as WSGI specification currently states that status, > >header names/values and the items returned by the iterable must all be > >string instances. This is done to ensure that the application has done > >any conversions from Unicode, where knowledge about encoding would be > >known, before being passed to WSGI adapter. > > > >In Python 3 the default for string type objects will effectively be > >Unicode. Is WSGI going to be made to somehow cope with that, or will > >application instead be required to return byte string objects instead? > > WSGI already copes, actually. Note that Jython and IronPython have > this issue today, and see: > > http://www.python.org/dev/peps/pep-0333/#unicode-issues > > """On Python platforms where the str or StringType type is in fact > Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all > "strings" referred to in this specification must contain only code > points representable in ISO-8859-1 encoding (\u through \u00FF, > inclusive). It is a fatal error for an application to supply strings > containing any other Unicode character or code point. Similarly, > servers and gateways must not supply strings to an application > containing any other Unicode characters.""" That may work for IronPython/Jython, where encoded data is represented by the str type, but it won't be sufficient for Py3k, where encoded data is represented using the bytes type. IOW, in IronPython/Jython, u"\u1234".encode('utf-8') returns a str instance: '\xe1\x88\xb4'; but in Py3k, it returns a bytes instance: b'\xe1\x88\xb4'. The issue applies to input as well as output -- data read from a socket is also represented as bytes, unless you're using makefile() with a text mode and an encoding. You might want to look at how the unittests for wsgiref manage to pass in Py3k though. ;-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007, at 7:15 PM, Phillip J. Eby wrote: > WSGI already copes, actually. Note that Jython and IronPython have > this issue today, and see: > > http://www.python.org/dev/peps/pep-0333/#unicode-issues > > """On Python platforms where the str or StringType type is in fact > Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all > "strings" referred to in this specification must contain only code > points representable in ISO-8859-1 encoding (\u through \u00FF, > inclusive). It is a fatal error for an application to supply strings > containing any other Unicode character or code point. Similarly, > servers and gateways must not supply strings to an application > containing any other Unicode characters.""" It would seem very odd, however, for WSGI/python3 to use strings- restricted-to-0xFF for network I/O while everywhere else in python3 is going to use bytes for the same purpose. You'd have to modify your app to call write(unicodetext.encode('utf-8').decode('latin-1')) or so James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: >On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: > > In Python 3 the default for string type objects will effectively be > > Unicode. Is WSGI going to be made to somehow cope with that, or will > > application instead be required to return byte string objects instead? > >I'd say it would be best to only accept `bytes` objects; anything else >would require some guesswork. Maybe, at most, it could try to encode >returned Unicode objects as ISO-8859-1, and have it be an error if >that's not possible. Actually, I'd prefer to look at it the other way around: a Python 3 WSGI server or middleware *may* accept bytes objects instead of str. This is relatively easy for the response side of things, but the request side is rather more difficult, since wsgi.input may need to be binary rather than text mode. (I think we can reasonably assume that wsgi.errors is a text mode stream, and should support a reasonable encoding.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote: >You might want to look at how the unittests for wsgiref manage to pass >in Py3k though. ;-) Unless they've been changed, I'd assume it's because they work with strings exclusively, and never do any encoding or decoding (which is outside WSGI's scope, at least in the current version). ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: > In Python 3 the default for string type objects will effectively be > Unicode. Is WSGI going to be made to somehow cope with that, or will > application instead be required to return byte string objects instead? I'd say it would be best to only accept `bytes` objects; anything else would require some guesswork. Maybe, at most, it could try to encode returned Unicode objects as ISO-8859-1, and have it be an error if that's not possible. I was going to say that the gateway could accept Unicode objects if the user-agent sent a comprehensible Accept-Charset header, and thereby encode application output to the client's preferred character set on the fly (or to ISO-8859-1 if no Accept-Charset is provided), but that would complicate things for people writing gateways (and would be too implicit). It could be useful, but it would make more sense as a simple decorator for (almost-)WSGI applications. Perhaps it could go in wsgiref. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 6:15 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > WSGI already copes, actually. Note that Jython and IronPython have > this issue today, and see: > > http://www.python.org/dev/peps/pep-0333/#unicode-issues I'm glad you brought that up, because it's been bugging me lately. That section is somewhat ambiguous as-is, because in one sentence applications are permitted to return strings encoded in a charset other than ISO-8859-1, but in another they are unequivocally forbidden to do so (with the "must not" in bold, even). And that's problematic not only because of the ambiguity, but because the increasing popularity of "AJAX" and web-based APIs is making it much more common for WSGI applications to generate responses of types which do not default to ISO-8859-1 -- e.g., XML and JSON, both of which default to UTF-8. Depending on how draconian one wishes to be when reading the relevant section of WSGI, it's possible to conclude that XML and JSON must always be transcoded/escaped to ISO-8859-1 -- with all the headaches that entails -- before being passed to a WSGI-compliant piece of software. And the slightly less strict reading of the spec -- that such gymnastics are required only when the string type of the Python implementation is Unicode-based -- will grow increasingly troublesome as/when Py3K enters production use. So as long as we're talking about this, could the proscriptions with respect to encoding perhaps be revisited and (hopefully) clarified/revised? -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 5:45 PM, Phillip J. Eby <[EMAIL PROTECTED]> wrote: > At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote: > >You might want to look at how the unittests for wsgiref manage to pass > >in Py3k though. ;-) > > Unless they've been changed, I'd assume it's because they work with > strings exclusively, and never do any encoding or decoding (which is > outside WSGI's scope, at least in the current version). Indeed, that seems mostly to be the case. But this means that any application that wants to emit characters outside Latin-1 cannot just encode() those characters, since the encode() output will be bytes and those will not be accepted by the WSGI API. OTOH sending non-Latin-1 characters without encoding would violate the standard. So something needs to give... -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Phillip J. Eby wrote: > At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: > >> On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: >>> In Python 3 the default for string type objects will effectively be >>> Unicode. Is WSGI going to be made to somehow cope with that, or will >>> application instead be required to return byte string objects instead? >> I'd say it would be best to only accept `bytes` objects; anything else >> would require some guesswork. Maybe, at most, it could try to encode >> returned Unicode objects as ISO-8859-1, and have it be an error if >> that's not possible. > > Actually, I'd prefer to look at it the other way around: a Python 3 > WSGI server or middleware *may* accept bytes objects instead of str. > > This is relatively easy for the response side of things, but the > request side is rather more difficult, since wsgi.input may need to > be binary rather than text mode. (I think we can reasonably assume > that wsgi.errors is a text mode stream, and should support a > reasonable encoding.) wsgi.input definitely seems like it should be bytes to me. Unless we want to put the encoding process into the server. Not entirely infeasible, but a bit of a strain. And the request body might very well be binary, e.g., on a PUT. The CGI keys in the environment don't feel at all like bytes to me, but then they aren't unicode either. They can be unicode, again given a bit of work on the server side. Though unfortunately browsers are very poor at indicating their encoding for requests, and it ends up being policy and configuration as much as anything that determines the encoding of stuff like wsgi.input. I believe all request paths are UTF8 (?), but I'm not sure about QUERY_STRING. I'm a little fuzzy on some of the details there. The actual response body should also be bytes. Unless again we want to introduce upstream encoding. This does make everything feel more complicated. -- Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 8:00 PM, Ian Bicking <[EMAIL PROTECTED]> wrote: > Phillip J. Eby wrote: > > At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: > > > >> On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: > >>> In Python 3 the default for string type objects will effectively be > >>> Unicode. Is WSGI going to be made to somehow cope with that, or will > >>> application instead be required to return byte string objects instead? > >> I'd say it would be best to only accept `bytes` objects; anything else > >> would require some guesswork. Maybe, at most, it could try to encode > >> returned Unicode objects as ISO-8859-1, and have it be an error if > >> that's not possible. > > > > Actually, I'd prefer to look at it the other way around: a Python 3 > > WSGI server or middleware *may* accept bytes objects instead of str. > > > > This is relatively easy for the response side of things, but the > > request side is rather more difficult, since wsgi.input may need to > > be binary rather than text mode. (I think we can reasonably assume > > that wsgi.errors is a text mode stream, and should support a > > reasonable encoding.) > > wsgi.input definitely seems like it should be bytes to me. Unless we > want to put the encoding process into the server. Not entirely > infeasible, but a bit of a strain. And the request body might very well > be binary, e.g., on a PUT. > > The CGI keys in the environment don't feel at all like bytes to me, but > then they aren't unicode either. They can be unicode, again given a bit > of work on the server side. Though unfortunately browsers are very poor > at indicating their encoding for requests, and it ends up being policy > and configuration as much as anything that determines the encoding of > stuff like wsgi.input. I believe all request paths are UTF8 (?), but > I'm not sure about QUERY_STRING. I'm a little fuzzy on some of the > details there. > > The actual response body should also be bytes. Unless again we want to > introduce upstream encoding. > > This does make everything feel more complicated. It's the same level of complexity you run into as soon as you want to handle Unicode with WSGI in 2.x though, as it is caused by something outside our control (HTTP and browsers). -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com