At 02:17 PM 8/27/2010 +1000, Graham Dumpleton wrote:
Since the major stumbling block, irrespective of other changes, to any
sort of agreement is still bytes vs unicode, and where we have a
reasonable clear definition of what unicode suggestion is, can we
please as a first step get a definition of what bytes actually implies
so everyone knows what we are talking about. I specifically ask this,
as it isn't clear because people don't explain in detail what they
mean when they are saying 'bytes'.
Going back to my definition #2 in my blog post from a year ago, I had:
1. The application is passed an instance of a Python dictionary
containing what is referred to as the WSGI environment. All keys in
this dictionary are native strings. For CGI variables, all names are
going to be ISO-8859-1 and so where native strings are unicode
strings, that encoding is used for the names of CGI variables
FYI, one thing that's changed here is the existence of os.environb in
Python 3.2, at least on non-Windows OSes.
2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
environment, the value of the variable should be a native string.
Since any meaningful use of this value is going to end up needing to
be bytes again (e.g. Location headers), and for consistency's sake, I
lean towards saying this is bytes too.
3. For the CGI variables contained in the WSGI environment, the values
of the variables are byte strings.
4. The WSGI input stream 'wsgi.input' contained in the WSGI
environment and from which request content is read, should yield byte
strings.
5. The status line specified by the WSGI application must be a byte string.
6. The list of response headers specified by the WSGI application must
contain tuples consisting of two values, where each value is a byte
string.
7. The iterable returned by the application and from which response
content is derived, must yield byte strings.
The points of disagreement I have seen about this is are as follows.
For (1), the keys should also be bytes, including names of 'wsgi.'
special keys.
For (2), the value of 'wsgi.url_scheme' should be bytes.
So, do you really want bytes absolutely everywhere, or are keys still
going to be unicode taken as ISO-8859-1.
If we follow the example of os.environb, then the keys have to be bytes also.
However, I can already see that the big problem with all of this is
that WSGI code is going to be littered with a plague of "b"s hanging
off the front of every string literal, and that 2to3 is probably not
going to handle it correctly. Making the keys bytes as well just
multiplies the problem.
Note that we are not agreeing to the final solution here, just what
bytes means in contrast to the unicode option, so we know that we are
comparing only two options and not many options because people have
different interpretations of what bytes means.
As contrast, what we generally mean by the unicode option is
definition #3 from my blog post. That being:
1. The application is passed an instance of a Python dictionary
containing what is referred to as the WSGI environment. All keys in
this dictionary are native strings. For CGI variables, all names are
going to be ISO-8859-1 and so where native strings are unicode
strings, that encoding is used for the names of CGI variables
2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
environment, the value of the variable should be a native string.
3. For the CGI variables contained in the WSGI environment, the values
of the variables are native strings. Where native strings are unicode
strings, ISO-8859-1 encoding would be used such that the original
character data is preserved and as necessary the unicode string can be
converted back to bytes and thence decoded to unicode again using a
different encoding.
4. The WSGI input stream 'wsgi.input' contained in the WSGI
environment and from which request content is read, should yield byte
strings.
5. The status line specified by the WSGI application should be a byte
string. Where native strings are unicode strings, the native string
type can also be returned in which case it would be encoded as
ISO-8859-1.
6. The list of response headers specified by the WSGI application
should contain tuples consisting of two values, where each value is a
byte string. Where native strings are unicode strings, the native
string type can also be returned in which case it would be encoded as
ISO-8859-1.
7. The iterable returned by the application and from which response
content is derived, should yield byte strings. Where native strings
are unicode strings, the native string type can also be returned in
which case it would be encoded as ISO-8859-1.
Even though call it unicode, it actually has bytes in places as well.
The key issues over bytes vs unicode has been in values in the
dictionary, but as pointed out about, not clear whether for bytes
option, we are talking about bytes for keys as well and for value of
'wsgi.url_scheme'.
The main issue I have with this option is that it seems to make it
trivially easy to write an app or piece of middleware that seems to
work correctly most of the time, unless placed in the right
combination with other apps or middleware.
More precisely, an updated wsgiref.validate module used to check the
"unicode option" would mark such apps and middleware as perfectly
spec-conformant, yet this spec-conformance would not be transitive -
i.e., you couldn't say that an assembly of spec-conformant middleware
and apps would be correct.
Hmmm... unless... I guess the only way to be really sure would be
if the validation process randomly changed the types of input and
output values to both ways allowed by the spec, and verified that the
results were still compliant. ;-)
(In practice, I expect that getting it to do that would be rather
difficult, though.)
Let me see if I can more precisely narrow down my concern.
Mostly, it boils down to the possibility of non-latin1 unicode
"escaping" into the output stream... so if #5, #6 and #7 above were
changed to bytes-only outputs, then an updated validator can enforce
those criteria, making spec-compliance verification
composable. (That is, if you combine two things that are verified
compliant, the combination is also known to be compliant.)
So, I could actually support a format that was "unicode (latin1)
headers in, bytes headers out", and "bytes stream in, bytes stream out".
You can then concentrate all your encoding or decoding operations at
one place, or even write a decorator to take care of it for you.
So, can we can clarify this first. And if you are going to comment,
for that extra clarity, cut and paste my definition #2 above and make
the changes to it so we have the full definition, rather than just
referring to bits. That way people who come and read this don't have
to troll through the whole email chain to derive the context.
Once we get that clarification, then we can perhaps discuss
exclusively any issues people have with that bytes definition. That is
before we even try to balance it against the unicode option or look at
other WSGI 2 changes such as dropping start_response and
wsgi.file_wrapper.
And I apologise in advance if I start getting cranky and people think
I am trying to hijack the conversation. I want a solution more so than
probably anyone else as I can't fix up mod_wsgi until there is and
right now am I feeling pretty unmotivated towards doing anything with
mod_wsgi at all, even non Python 3.X enhancements because of all this.
So, if we can keep focus and try going one step at a time, maybe I
will not got ballistic. ;-)
Thanks for hanging in there, and also for posting this summary!
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com