Ian, know you have seen this before, but didn't realise you hadn't cc'd the list. I have added a new response to part 4 of what you originally sent that wasn't in first reply that went direct to you.
2009/8/4 Ian Bicking <i...@colorstudy.com>: > On Mon, Aug 3, 2009 at 7:38 PM, Graham > Dumpleton<graham.dumple...@gmail.com> wrote: >> So, for WSGI 1.0 style of interface and Python 3.0, the following is >> what I was going to implement. >> >> 1. When running under Python 3, applications SHOULD produce bytes >> output, status line and headers. > > Sure. > >> This is effectively what we had before. The only difference is that >> clarify that the 'status line' values should also be bytes. This >> wasn't noted before. I had already updated the proposed WSGI 1.0 >> amendments page to mention this. >> >> 2. When running under Python 3, servers and gateways MUST accept >> strings for output, status line and headers. Such strings must be >> converted to bytes output using 'latin-1'. If string cannot be >> converted then is treated as an error. >> >> This is again what we had before except that mention 'status line' value. > > Sure. ASCII for the status would be acceptable, as I believe that is > an HTTP constraint. > >> 3. When running under Python 3, servers MUST provide wsgi.input as a >> binary (byte) input stream. >> >> No change here. > > Yep. > >> 4. When running under Python 3, servers MUST provide a text stream for >> wsgi.errors. In converting this to a byte stream for writing to a >> file, the default encoding would be applied. >> >> No real change here except to clarify that default encoding would >> apply. Use of default encoding though could be problematic if >> combining different WSGI components. This is because each WSGI >> component may have been developed on system with different default >> encoding and so one may expect to log characters that can't be written >> on a different setup. Not sure how you could solve that except to say >> people have default encoding be UTF-8 for portability. > > Sure. We might specify that the server should never give an encoding > error; it should use 'replace' or something to make sure it won't > fail. Maybe it should be specified what should happen when bytes are > received. I generally believe that error handling code should try to > be as robust as possible, so it shouldn't fail regardless of what it > is given. Not that it matters, but looks like that for Apache/mod_wsgi wsgi.errors should be an instance of io.TextIOWrapper wrapping internal mod_wsgi specific buffer object providing interface compatible with io.BufferedIOBase. If someone uses write() on wrapper with bytes it will fail: TypeError: write() argument 1 must be str, not bytes If someone use print() to output data, then bytes would be converted okay. That is: print(b'1234', file=environ['wsgi.errors']) yields: b'1234'. If 'replace' is used for errors, you do end up with data loss. Use of 'xmlcharrefreplace' at least preserves values as numbers, although for Apache at least, if use 'ascii' encoding, you get a bit of a mess as the backslashes get escaped again. \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 instead of original: \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 That is because Apache logging functions escape anything which isn't printable ASCII and in turn escapes backslash denoting escaped character. If use encoding of utf-8 instead, then byte values get passed and Apache logging functions then just escape the non printable bytes instead so all up looks nicer. \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 So for Apache/mod_wsgi at least, best thing to do seems to use 'replace' and 'utf-8' due to way that Apache error logging functions work. I guess the point from this is that possibly should specify that wsgi.errors should be an instance of io.TextIOWrapper. A specific implementation should not use 'strict', but use 'replace' or 'backslashreplace' as makes sense, dependent on what encoding it needs to use and how any underlying logging system it overlays works. The intent overall being to preserve as much of raw information as possible. >> 5. When running under Python 3, servers MUST provide CGI HTTP and >> server variables as strings. Where such values are sourced from a byte >> string, be that a Python byte string or C string, they should be >> converted as 'UTF-8'. If a specific web server infrastructure is able >> to support different encodings, then the WSGI adapter MAY provide a >> way for a user of the WSGI adapter to customise on a global basis, or >> on a per value basis what encoding is used, but this is entirely >> optional. Note that there is no requirement to deal with RFC 2047. > > Ugh. This is where I'm not happy with how WSGI 1 in Python 3 has been > treated. I think it should be bytes, just like it is in Python 2. I still don't understand what is the practical, vs theoretical use case for that in Python 3. In Python 2 bytes strings work out okay because url routing rules through whatever means is generally also going to be defined in terms of byte strings. In Python 3 however, routing is going to likely default to being defined with strings and as such, any information like SCRIPT_NAME, PATH_INFO and QUERY_STRING are going to have to almost immediately be converted to strings from bytes to apply routing rules anyway. Can you expand on what benefits come from and what practical use case would predominate that would mean that bytes would be the better option? > But if we have an encoding, I guess UTF8 is okay so long as it uses > PEP 383: http://www.python.org/dev/peps/pep-0383/ -- for the most part > PEP 383, and putting the encoding that was used into the environment, > makes transcoding doable. PEP 383 doesn't allow for transcoding > unless you keep track of the encoding used, so we have to store that > in the environment. Again, what practical use cases are there where transcoding would be necessary, especially if it was a requirement that the WSGI adapter/server at lowest level, if it makes sense for that server infrastructure, ie., can support something other than UTF-8, to provide an option to supply WSGI environ values, all or selected, interpreted as a different encoding? If the option is at the WSGI adapter/server level and managed at the point of original translation from bytes, then a WSGI application or middleware doesn't need to worry about it. As such, noting what encoding was used in the environment serves no purpose except for information purposes. Marking what encoding was used also would not necessarily be straight forward if the WSGI adapter/server provided a way of overriding encoding used for specific values, because one value for encoding indicator would not suffice. To allow experimentation with encoding of values, current mod_wsgi code allowed overriding of values on global or individual basis. This was done via an Apache directive, but as had to pass this information from main Apache worker process to mod_wsgi daemon process, did it in such a way that also visible to application for information purposes at this point. Was using convention as follows. # Override encoding for everything to UTF-8. mod_wsgi.variable_encoding: UTF-8 # Override encoding and pass raw byes for everything. mod_wsgi.variable_encoding: - # Override encoding of specific value to UTF-8. mod_wsgi.variable_encoding.SCRIPT_NAME: UTF-8 # Override encoding and pass raw bytes for specific value. mod_wsgi.variable_encoding.SCRIPT_NAME: - If default encoding used for everything, then no value passed at all. In respect of passing bytes for values, we get back to argument from past discussions as to what should be passed as bytes. Do you only do SCRIPT_NAME, PATH_INFO and QUERY_STRING? What about server specific variables such as REQUEST_URI? What about headers such as Referrer? What about custom user values set using something like SetEnv directive in Apache? This is where it started to turn into a can of worms last time. You either treat everything as UTF-8 to be consistent, or use bytes for everything, in which case a great deal more work is put onto WSGI applications even for potentially simple stuff, effectively forcing the use of high level request wrappers like WebOb or request object in Werkzeug. In summary, what are the practical uses cases that would make passing bytes over UTF-8 or even latin-1 worthwhile? If passing bytes, what values should be passed as bytes and what left alone? What practical use cases are there that would necessitate transcoding? Some actual practical examples of stuff would very much help in this discussion as we tend to kee talking about what is theoretical possibilities rather than actual practice. Graham _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com