Re: [Web-SIG] WSGI 2: Decoding the Request-URI
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote: However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading / character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least not without their specification becoming unwieldy. (Just to make things clear, I am not just talking about REQUEST_URI here, but all request headers) Encoding everything using ISO-8859-1 has the nice property of keeping informations intact. It would be good heuristic if everything with a few exceptions was encoded using ISO-8859-1. Just transcode the few problematic cases at the application level and everybody is happy. A string encoded from ISO-8859-1 is like a bytes object with a string 'interface' on top of it. But it sweep the encoding problem under the carpet. The problem with Python 2 was that str and unicode were almost the same, so much the same that it was possible to mix them without too much problems: 'foo' == u'foo' True Python 3 made bytes and string 'incompatible' to force programmers to handle the encoding problem as soon as possible: b'foo' == 'foo' False By passing `str()` to the application, the application author could believe that the encoding problem has been handled. But in most cases it hasn't been handled at all. The application author should still transcode all the strings incorrectly encoded. We are back to Python 2's bad old days, where we can't be sure that what we got is properly encoded: Was that string encoded using latin-1? Maybe a middleware transcoded it to UTF-8 before the application was called. Maybe the application itself transcoded it at some point, but then we need to keep track of what was transcoded. Maybe the application should transcode everything when it is called. Also EVERY application author will have to read the PEP, especially the paragraph saying: Everything we give you are strings, but you still have to deal with the encoding mess. Otherwise he will have weird problems like when he was using Python 2. Because the interface is not clear. strings are supposed to be text and only text. Encoding everything to ISO-8859-1 means strings are not text anymore, they are 'encoded data' [1]. bytes are supposed to be 'encoded data' and binary blobs. By giving applications bytes, the author knows right away he should decode them. No need to read the PEP. `bytes` can do everything `str` can do with the notable exception of 'format'. b'foo bar'.title() b'Foo Bar' b'/foo/bar/fran\xc3ois'.split(b'/') [b'', b'foo', b'bar', b'fran\xc3ois'] re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups() (b'foo', b'1234') I understand that `bytes()` is an unfamiliar beast. But I believe the encoding problem is the realm of the application, not the realm of the gateway. Let the application handle the encoding problem and don't give it a half baked solution. Using bytes also has its set of problems. The standard library doesn't support bytes very well. For example urllib.response.unquote() doesn't work with bytes, and urllib.parse too has issues. [1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2: Decoding the Request-URI
I wrote: Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's. ... As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application. ... In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding. In other words, the URI spec seems to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard. ... For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Did I say 3 reasons? I meant 4: Accept-Charset. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2: Decoding the Request-URI
At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote: Did I say 3 reasons? I meant 4: Accept-Charset. Chief amongst the reasons... amongst our reasonry... Right, we'll come in again. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
[Web-SIG] WSGI 2: Decoding the Request-URI
I wrote: PATH_INFO and QUERY_STRING are ... decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. and Ian replied: My understanding is that PATH_INFO *should* be UTF-8 regardless of what encoding a page might be in. At least that's what I got when testing Firefox. It might not be valid UTF-8 if it was manually constructed, but then there's little reason to think it is valid... Actually, current browsers tend to use UTF-8 for the path, and either the encoding of the document [1] or Windows-1252 [2] for the querystring. But the vast majority of HTTP user agents are not browsers [3]. Even if that were not so, we should not define WSGI to only interoperate with the most current browsers. and Graham added: Thinking about it for a while, I get the feel that having a fallback to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That URLs wouldn't consistently use the same encoding all the time just seems wrong. I would see it as returning a bad request status. If an application coder knows they are actually going to be dealing with latin-1, as that is how the application is written, then they should be specifying it should be latin-1 always instead of utf-8. Thus, the WSGI adapter should provide a means to override what encoding is used. Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's. For simple WSGI adapters which only service one WGSI application, then it would apply to whole URL namespace. As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application. The encoding used for a URI is only really important for one reason: URI comparison. Comparison is at the heart of handler dispatch, static resource identification, and proper HTTP cache operation. It is for these reasons that RFC 3986 has an extensive section on the matter [5], including a ladder of approaches: * Simple String Comparison * Case Normalization (e.g. /a%3D == /a%3d) * Percent-Encoding Normalization (e.g. /a%62c == /abc) * Path Segment Normalization (e.g. /abc/../def == /def) * Scheme-Based Normalization (e.g. http://example.com == http://example.com:80/) * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing showed it to be) I think it would be beneficial to those who develop WSGI application interfaces to be able to assume that at least case-, percent-, path-, and scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by all WSGI 2 origin servers. All of those except for the first one can be accomplished without decoding the target URI. But that first section specifically states: In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding. In other words, the URI spec seems to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard. For something like Apache where could map to multiple WSGI applications, then it may want to provide means of overriding encoding for specific subsets o URLs, ie., using Location directive for example. For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Yet we still haven't answered the question of how to handle unforeseen encodings. You're right that, if the server-side stack as a whole cannot map a particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 over 400, but either is fine. However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading / character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least
Re: [Web-SIG] WSGI 2
On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote: Correct -- you can write any set of % encodings, and I don't think it even has to be able to validly url-decode (e.g., /foo%zzz will work). It definitely doesn't have to be a valid encoding. However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard). This is in a case like a href=/some page, the browser will request /some%20page, because it escapes unsafe characters. Similarly if you request a href=/fran??ais it will encode that ?? in UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at least on Firefox. I used this to test: http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py I have run some tests regarding the encoding issue: curl doesn't 'url-encode' its URLs: curl 'http://hostname/fran?ais' ^ e7 latin-1 character The latin-1 character is send to the server. Lighttpd accepts the URL and even return a file if it exists. Of course if I try with the same characters in UTF-8 it doesn't work. AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that libcurl is quite popular (it used to be the transport library of Webkit/GTK+ for example.) It's hard to discard it as a utterly broken obscure tool. Many 'simplistic' HTTP clients may have the same problem. Now let's talk a little bit about cookies... Cookies can contain whatever 'binary junk' the server send. RFC 2965 says (http://tools.ietf.org/html/rfc2965#page-5): The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. Also, cookies can contain 'comments' which contains UTF-8 strings. (http://tools.ietf.org/html/rfc2965#page-6): Characters in value MUST be in UTF-8 encoding. Firefox has no problem with cookies containing non-ASCII characters. It looks like it assumes cookies are encoded using latin-1, since latin-1 characters are displayed correctly in Firebug, but not UTF-8 ones. Cheers, -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
2009/8/12 Henry Precheur he...@precheur.org: Using bytes for all `environ` values is easy to understand on the application side as long as you are aware of the encoding problem. The cost is inconvenience, but that's probably OK. It's also simpler to implement on the gateway/server side. Use of bytes everywhere can be inconvenient on the gateway/server side, at least as far as end result for user. The specific problem is that WSGI environment is used to hold information about the original request, as CGI variables, but also can hold user specified custom variables. In the case of anything hosted via Apache, such as through mod_wsgi, mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such custom variables using the SetEnv directive. Thus one might say: SetEnv trac.env_path /usr/local/trac/site-1 If the rule is that everything in WSGI environment coming from WSGI adapter must be bytes then you have a potential for mismatch in expectations of how values will be passed. That is, if set using SetEnv then would be bytes, but if set using WSGI middleware wrapper for configuration, more likely going to be string. It would seem overly onerous to expect WSGI middleware to use bytes for configuration variables as well and so force all consumers to always be converting to string using appropriate encoding, where required encoding potentially unknown. The underlying problem here is in part, albeit maybe from convention, that there is a single dictionary for both request information and user configuration. It isn't though a simple matter of splitting them either so that request information is always separate. This is because for FASTCGI, SCGI, CGI, you can't split them as only one grouping in those cases. This is why I specifically asked previously, and which no one has answered, if bytes is to be used, which variables in WSGI environment should be passed as bytes. If there is a known specified list of variables which it is known will always be bytes, may be more manageable. If someone is going to suggest that only CGI variables should be bytes, then what does that actually mean. Remember that for FASTCGI, SCGI, CGI there isn't really a distinction and so where the boundary is as to what is a CGI variable is fuzzy although you could reverse transformation and get back bytes if know what to do it for. One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and QUERY_STRING and maybe that will suffice. It may not though, because what about headers such as HTTP_REFERRER? Also, what about additional SSL_? variables that a SSL module for web sever may add? Graham By choosing bytes, WSGI passes the encoding problem to the application, which is good. Let's the application deal with that. It's more likely to know what it needs, and what problem it can ignore. I think that 99% of the time, applications will just decode bytes to string using UTF-8, ignoring invalid values. However it's likely that we'll see middlewares converting ALL environment values to UTF-8, because it's more convienient than using bytes. And some middlewares might depend on `environ` values being string instead of bytes, because it's convenient too. This issue was already raised by Graham. And I think it's important to make it clear. I believe that 'server/CGI' values in the environment shouldn't be modified--Of course it should still be possible to add new values. This way the stack will always remain in a 'sane' state. For example if a middleware wants to convert environ values to UTF-8, it shouldn't do that: for key, value in environ.items(): environ[key] = str(value) But something like this--assuming there's only bytes in `environ`: environ['unicode.environ'] = dict((key, str(value, encoding='utf8')) for key, value in environ.items()) I'm in favor of using bytes everywhere. But it's important to document why bytes are used and how to use them. I'm not sure this should be included in a PEP, maybe a WSGI best practices? Cheers, -- Henry Prêcheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
Graham Dumpleton wrote: So, for WSGI 1.0 style of interface and Python 3.0, the following is what I was going to implement. FWIW, I'll answer with what we've implemented for CherryPy 3.2. 1. When running under Python 3, applications SHOULD produce bytes output, status line and headers. Yup. 2. When running under Python 3, servers and gateways MUST accept strings for output, status line and headers. Such strings must be converted to bytes output using 'latin-1'. If string cannot be converted then is treated as an error. Yes. 3. When running under Python 3, servers MUST provide wsgi.input as a binary (byte) input stream. Boy howdy. 4. When running under Python 3, servers MUST provide a text stream for wsgi.errors. In converting this to a byte stream for writing to a file, the default encoding would be applied. I'll look into it. 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there is no requirement to deal with RFC 2047. We're passing unicode for almost everything. REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries. The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset. All request headers are decoded via ISO-8859-1, which can't fail. Applications are expected to transcode these values if they believe them to be in another encoding. This is where I am going to diverge from what has been discussed before. The reason I am going to pass as UTF-8 and not latin-1 is that it looks like Apache effectively only supports use of UTF-8. Since this means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and even CGI likely cannot handle anything besides UTF-8 then I really can't see the point of trying to cater for a theoretical possibility that some HTTP client could use something besides UTF-8. In other words, the predominant case will be UTF-8, so let us target that. That is predominant for the Request-URI, and we are defaulting to utf-8 for that as I mentioned above. I believe I demonstrated in http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 cannot be the predominant encoding for request headers, which are instead mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to ISO-8859-1. So, rather than burden every WSGI application with the need to convert from latin-1 back to bytes and then to UTF-8, let the server deal with it, with server using sensible default, and where server infrastructure can handle a different encoding, then it can provide option to use that encoding and WSGI application doesn't need to change. If there are indeed more headers which are ISO-8859-1, then that same argument cuts both ways. I have no problem doing the same thing here as we do for PATH_INFO: a configurable charset, or better yet a list of charsets to try in order, with a sensible default, even UTF-8 would be fine. Regardless of the default, if it is configurable, then the successful encoding should be put in a canonical environ entry so apps can transcode it if the server got it wrong. Re:bytes. We really do not want the server to set any of the above environ entries (except REQUEST_URI) to bytes. I'm surprised those of you who have substantial numbers of WSGI middleware aren't fighting this; it would mean decoding the same environ entries every time you switched middleware providers. Some of you said as much at PyCon: http://mail.python.org/pipermail/web-sig/2009-March/003701.html Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer fuman...@aminus.org wrote: 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there is no requirement to deal with RFC 2047. We're passing unicode for almost everything. REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries. The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset. My understanding is that PATH_INFO *should* be UTF-8 regardless of what encoding a page might be in. At least that's what I got when testing Firefox. It might not be valid UTF-8 if it was manually constructed, but then there's little reason to think it is valid anything; only the bytes or REQUEST_URI are likely to be an accurate representation. (Frankly I wish PATH_INFO was not url-decoded, which would remove this issue entirely -- REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't know of reasonable cases where this wouldn't be true.) I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be used to kind of reconstruct the original request path (the surrogateescape or whatever it is called would serve the same purpose, but is only available in Python 3). -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
2009/8/12 Ian Bicking i...@colorstudy.com: On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer fuman...@aminus.org wrote: 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there is no requirement to deal with RFC 2047. We're passing unicode for almost everything. REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries. The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset. My understanding is that PATH_INFO *should* be UTF-8 regardless of what encoding a page might be in. At least that's what I got when testing Firefox. It might not be valid UTF-8 if it was manually constructed, but then there's little reason to think it is valid anything; only the bytes or REQUEST_URI are likely to be an accurate representation. As I understood it, PJE was suggesting that wasn't the case. For example, what about case where URL appears for target of form POST and the encoding of that form page wasn't UTF-8. What is the browser going to send in that case. Or is this the sort of case you have tested and qualify as saying if manually constructed anything could happen? (Frankly I wish PATH_INFO was not url-decoded, which would remove this issue entirely -- REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't know of reasonable cases where this wouldn't be true.) I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be used to kind of reconstruct the original request path (the surrogateescape or whatever it is called would serve the same purpose, but is only available in Python 3). Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
Ian, know you have seen this before, but didn't realise you hadn't cc'd the list. I have added a new response to part 4 of what you originally sent that wasn't in first reply that went direct to you. 2009/8/4 Ian Bicking i...@colorstudy.com: On Mon, Aug 3, 2009 at 7:38 PM, Graham Dumpletongraham.dumple...@gmail.com wrote: So, for WSGI 1.0 style of interface and Python 3.0, the following is what I was going to implement. 1. When running under Python 3, applications SHOULD produce bytes output, status line and headers. Sure. This is effectively what we had before. The only difference is that clarify that the 'status line' values should also be bytes. This wasn't noted before. I had already updated the proposed WSGI 1.0 amendments page to mention this. 2. When running under Python 3, servers and gateways MUST accept strings for output, status line and headers. Such strings must be converted to bytes output using 'latin-1'. If string cannot be converted then is treated as an error. This is again what we had before except that mention 'status line' value. Sure. ASCII for the status would be acceptable, as I believe that is an HTTP constraint. 3. When running under Python 3, servers MUST provide wsgi.input as a binary (byte) input stream. No change here. Yep. 4. When running under Python 3, servers MUST provide a text stream for wsgi.errors. In converting this to a byte stream for writing to a file, the default encoding would be applied. No real change here except to clarify that default encoding would apply. Use of default encoding though could be problematic if combining different WSGI components. This is because each WSGI component may have been developed on system with different default encoding and so one may expect to log characters that can't be written on a different setup. Not sure how you could solve that except to say people have default encoding be UTF-8 for portability. Sure. We might specify that the server should never give an encoding error; it should use 'replace' or something to make sure it won't fail. Maybe it should be specified what should happen when bytes are received. I generally believe that error handling code should try to be as robust as possible, so it shouldn't fail regardless of what it is given. Not that it matters, but looks like that for Apache/mod_wsgi wsgi.errors should be an instance of io.TextIOWrapper wrapping internal mod_wsgi specific buffer object providing interface compatible with io.BufferedIOBase. If someone uses write() on wrapper with bytes it will fail: TypeError: write() argument 1 must be str, not bytes If someone use print() to output data, then bytes would be converted okay. That is: print(b'1234', file=environ['wsgi.errors']) yields: b'1234'. If 'replace' is used for errors, you do end up with data loss. Use of 'xmlcharrefreplace' at least preserves values as numbers, although for Apache at least, if use 'ascii' encoding, you get a bit of a mess as the backslashes get escaped again. \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 instead of original: \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 That is because Apache logging functions escape anything which isn't printable ASCII and in turn escapes backslash denoting escaped character. If use encoding of utf-8 instead, then byte values get passed and Apache logging functions then just escape the non printable bytes instead so all up looks nicer. \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 So for Apache/mod_wsgi at least, best thing to do seems to use 'replace' and 'utf-8' due to way that Apache error logging functions work. I guess the point from this is that possibly should specify that wsgi.errors should be an instance of io.TextIOWrapper. A specific implementation should not use 'strict', but use 'replace' or 'backslashreplace' as makes sense, dependent on what encoding it needs to use and how any underlying logging system it overlays works. The intent overall being to preserve as much of raw information as possible. 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there
Re: [Web-SIG] WSGI 2
Graham Dumpleton wrote: Ian, know you have seen this before, but didn't realise you hadn't cc'd the list. I have added a new response to part 4 of what you originally sent that wasn't in first reply that went direct to you. 2009/8/4 Ian Bicking i...@colorstudy.com: On Mon, Aug 3, 2009 at 7:38 PM, Graham Dumpletongraham.dumple...@gmail.com wrote: So, for WSGI 1.0 style of interface and Python 3.0, the following is what I was going to implement. 1. When running under Python 3, applications SHOULD produce bytes output, status line and headers. Sure. This is effectively what we had before. The only difference is that clarify that the 'status line' values should also be bytes. This wasn't noted before. I had already updated the proposed WSGI 1.0 amendments page to mention this. 2. When running under Python 3, servers and gateways MUST accept strings for output, status line and headers. Such strings must be converted to bytes output using 'latin-1'. If string cannot be converted then is treated as an error. This is again what we had before except that mention 'status line' value. Sure. ASCII for the status would be acceptable, as I believe that is an HTTP constraint. 3. When running under Python 3, servers MUST provide wsgi.input as a binary (byte) input stream. No change here. Yep. 4. When running under Python 3, servers MUST provide a text stream for wsgi.errors. In converting this to a byte stream for writing to a file, the default encoding would be applied. No real change here except to clarify that default encoding would apply. Use of default encoding though could be problematic if combining different WSGI components. This is because each WSGI component may have been developed on system with different default encoding and so one may expect to log characters that can't be written on a different setup. Not sure how you could solve that except to say people have default encoding be UTF-8 for portability. Sure. We might specify that the server should never give an encoding error; it should use 'replace' or something to make sure it won't fail. Maybe it should be specified what should happen when bytes are received. I generally believe that error handling code should try to be as robust as possible, so it shouldn't fail regardless of what it is given. Not that it matters, but looks like that for Apache/mod_wsgi wsgi.errors should be an instance of io.TextIOWrapper wrapping internal mod_wsgi specific buffer object providing interface compatible with io.BufferedIOBase. If someone uses write() on wrapper with bytes it will fail: TypeError: write() argument 1 must be str, not bytes If someone use print() to output data, then bytes would be converted okay. That is: print(b'1234', file=environ['wsgi.errors']) yields: b'1234'. If 'replace' is used for errors, you do end up with data loss. Use of 'xmlcharrefreplace' at least preserves values as numbers, although for Apache at least, if use 'ascii' encoding, you get a bit of a mess as the backslashes get escaped again. \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10 instead of original: \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10 That is because Apache logging functions escape anything which isn't printable ASCII and in turn escapes backslash denoting escaped character. If use encoding of utf-8 instead, then byte values get passed and Apache logging functions then just escape the non printable bytes instead so all up looks nicer. \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90 So for Apache/mod_wsgi at least, best thing to do seems to use 'replace' and 'utf-8' due to way that Apache error logging functions work. I guess the point from this is that possibly should specify that wsgi.errors should be an instance of io.TextIOWrapper. A specific implementation should not use 'strict', but use 'replace' or 'backslashreplace' as makes sense, dependent on what encoding it needs to use and how any underlying logging system it overlays works. The intent overall being to preserve as much of raw information as possible. 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value
Re: [Web-SIG] WSGI 2
At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote: 2009/8/4 P.J. Eby p...@telecommunity.com: I'm not clear on your logic here. If I request foo/bar/baz (where baz actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the script, then the (accented) baz is legitimate for pass-through to the application, no? Technically, but what I am pointing out is that Apache pretty well says that foo/bar needs to be UTF-8. Which doesn't change the fact that you haven't yet proposed what a WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and QUERY_STRING. Apache can and does pass through such bytes, so the spec needs to say what we do with them. If you are going to have different parts of the one URL needing a different encoding to be understood, personally I would say you asking for trouble. So, am saying that UTF-8 needs to really apply more for sake of sanity and portability. So what, precisely, are you proposing should happen when such bytes are present? So I guess the problem is more where URLs are already % encoded when coming back as href or form action because they may be in an encoding incompatible with UTF-8 if it were to be clicked on. Yep, that's the case with standard browsers and servers; less-standard situations such as spiders and scripts generating or following URLs are also relevant, as are deliberate hack attempts. So having the result of this behavior be undefined is a bad thing. The Apache server at least will decode those % escape sequence and I believe it is the result of that which is used in stuff like rewrite rule matches, not the raw URL. The only exception would be if rewrite rule explicit matched against REQUEST_URI variable which still contains % escape sequences. So if not in UTF-8, means effectively that you can't then match them with Apache rewrite rules then. That's got nothing to do with what you propose for WSGI to do with the rest of it, though. (However, your belief may be incorrect in any event, as this page: http://www.dracos.co.uk/code/apache-rewrite-problem/ claims that mod_rewrite can RewriteCond on THE_REQUEST in order to match still-encoded paths.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: In summary, what are the practical uses cases that would make passing bytes over UTF-8 or even latin-1 worthwhile? My concern at this point is a nagging feeling that we are abandoning WSGI-HTTP equivalence for convenience in the face of changes in Python's defaults. Had Python 3 been the standard version in existence when WSGI 1 was created, I would've argued for making *everything* bytes, in order to: 1. Force all encodings to be explicit, and 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects) And this is why the original spec said that Unicode strings should be treated as bytes -- because byte strings were always the original target of the spec. Please remember that WSGI is not primarily intended to provide application developers with a convenient API; its first and most important job is to ship the data around without mangling it in the process. HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, it would be good to *also* support strings on the application side, especially for application migration. However, I see no reason to make *servers* provide decoded strings instead of bytes. So I would ask, what is the practical use case for having the server decode bytes into strings, instead of leaving them as bytes? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Tue, Aug 4, 2009 at 11:05 AM, P.J. Ebyp...@telecommunity.com wrote: 1. Force all encodings to be explicit, and This can be handled without forcing application authors to work with bytestrings (or forcing them to remember to coerce to bytestrings before returning responses). 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects) TBH, WSGI doesn't expose enough of HTTP's functionality to convince me that this is a good argument. When I can use advanced HTTP features (chunked transfer and friends) from a WSGI app, maybe I'll feel differently. Please remember that WSGI is not primarily intended to provide application developers with a convenient API; its first and most important job is to ship the data around without mangling it in the process. Which it should try very hard to do without forcing *in*convenient APIs onto developers. So I would ask, what is the practical use case for having the server decode bytes into strings, instead of leaving them as bytes? Well, Django (for one example) already does some gymnastics to ensure that character encoding issues are kept at the request/response boundary, largely because it's an utter pain for an application developer to have an API dump a bunch of bytestrings in your lap and say here, *you* figure it out. I suspect we're going to keep on doing that, since it's a big win in terms of usability for application developers (who end up having to deal with only a drastically-reduced subset of character-encoding problems). -- Bureaucrat Conrad, you are technically correct -- the best kind of correct. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote: At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: In summary, what are the practical uses cases that would make passing bytes over UTF-8 or even latin-1 worthwhile? My concern at this point is a nagging feeling that we are abandoning WSGI-HTTP equivalence for convenience in the face of changes in Python's defaults. Had Python 3 been the standard version in existence when WSGI 1 was created, I would've argued for making *everything* bytes, in order to: 1. Force all encodings to be explicit, and 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects) And this is why the original spec said that Unicode strings should be treated as bytes -- because byte strings were always the original target of the spec. Please remember that WSGI is not primarily intended to provide application developers with a convenient API; its first and most important job is to ship the data around without mangling it in the process. HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, it would be good to *also* support strings on the application side, especially for application migration. However, I see no reason to make *servers* provide decoded strings instead of bytes. +1 I haven't had enough time to follow this and earlier encoding discussions and so haven't commented up to now, but I've always been uncomfortable with WSGI using anything but bytes or assuming any encoding. I agree that application frameworks should deal with conversion between bytes and unicode. Jim -- Jim Fulton ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Aug 4, 2009, at 12:38 PM, James Bennett wrote: TBH, WSGI doesn't expose enough of HTTP's functionality to convince me that this is a good argument. When I can use advanced HTTP features (chunked transfer and friends) from a WSGI app, maybe I'll feel differently. But that works just fine today. Your WSGI app sends streaming data back using the iterator functionality, and the server automatically turns it into chunks if it's talking to an HTTP 1.1 client. What's the problem? James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jim Fulton wrote: On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote: At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote: In summary, what are the practical uses cases that would make passing bytes over UTF-8 or even latin-1 worthwhile? My concern at this point is a nagging feeling that we are abandoning WSGI-HTTP equivalence for convenience in the face of changes in Python's defaults. Had Python 3 been the standard version in existence when WSGI 1 was created, I would've argued for making *everything* bytes, in order to: 1. Force all encodings to be explicit, and 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects) And this is why the original spec said that Unicode strings should be treated as bytes -- because byte strings were always the original target of the spec. Please remember that WSGI is not primarily intended to provide application developers with a convenient API; its first and most important job is to ship the data around without mangling it in the process. HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, it would be good to *also* support strings on the application side, especially for application migration. However, I see no reason to make *servers* provide decoded strings instead of bytes. +1 I haven't had enough time to follow this and earlier encoding discussions and so haven't commented up to now, but I've always been uncomfortable with WSGI using anything but bytes or assuming any encoding. I agree that application frameworks should deal with conversion between bytes and unicode. +1 from me as well. The fact that Python3 now calls 'string' what used to be 'unicode' doesn't change the fact that transport-level operations have to be done in bytes. It should be the framework / application's job to handle conversion of byte inputs from the request onto strings, and string response fields onto bytes: ideally, the framework will do this in a way which keeps the application writer blissfully ignorant of the distinction. Note that I think Python3 gets the os.evniron bit wrong for exactly the same reasons: I think anybody wanting to use the environment-as-provided-by-the-OS should deal in bytes (or whatever the OS provides), with a convenience wrapper for those who don't care about the difference. I lost that argument, but that doesn't mean I was wrong. :) Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKeHLg+gerLs4ltQ4RAiFjAJ9uZIkfxwh5w1aYiEdIpr+2yQ+iBwCeJiFM eUfWBoPwyzwHThkMwd24SZE= =lod9 -END PGP SIGNATURE- ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Tue, Aug 4, 2009 at 11:54 AM, James Y Knightf...@fuhm.net wrote: But that works just fine today. Your WSGI app sends streaming data back using the iterator functionality, and the server automatically turns it into chunks if it's talking to an HTTP 1.1 client. What's the problem? No, it doesn't work just fine today. Either the server has to assume that every response from that application should be chunked (which is wrong), or the application needs a way to tell the server to chunk. Turns out HTTP has a way to indicate that, but WSGI outright forbids its use. So instead you have to invent out-of-band mechanisms for the application to tell the server what to do, and in the process reinvent part of HTTP. -- Bureaucrat Conrad, you are technically correct -- the best kind of correct. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
* Graham Dumpleton wrote: Now, the reason why Apache can't really handle anything besides UTF-8 relates to how filenames are encoded in the file system. Taking Windows first as it is the more obvious case. What Apache does there is take whatever path it has mapping to a script file, be it constructed partially from what is in Apache configuration and partially from what was supplied in URL from client, and converts it to UCS2 for passing to Windows file system routines. In converting to UCS2, Apache assumes that the path will be UTF-8. This means that the Apache configuration file has to be UTF-8 and that the URL as supplied by the client is UTF-8 as well after any URL character encoding is decoded. End result, can only handle UTF-8. This is the only platform where the apache does that, actually, because it doesn't work any other way on windows (everything is passed to the system as ucs-2). So I wouldn't call that apache requires utf-8 everywhere. If I would care, I would even make it configurable on windows, but I don't ;) [...] nd ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
* Jim Fulton wrote: On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote: HTTP moves bytes, therefore WSGI should move bytes. For practical reasons, it would be good to *also* support strings on the application side, especially for application migration. However, I see no reason to make *servers* provide decoded strings instead of bytes. +1 I haven't had enough time to follow this and earlier encoding discussions and so haven't commented up to now, but I've always been uncomfortable with WSGI using anything but bytes or assuming any encoding. I agree that application frameworks should deal with conversion between bytes and unicode. Another +1 from the peanut gallery. nd ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
James Bennett wrote: On Tue, Aug 4, 2009 at 11:54 AM, James Y Knightf...@fuhm.net wrote: But that works just fine today. Your WSGI app sends streaming data back using the iterator functionality, and the server automatically turns it into chunks if it's talking to an HTTP 1.1 client. What's the problem? No, it doesn't work just fine today. Either the server has to assume that every response from that application should be chunked (which is wrong), or the application needs a way to tell the server to chunk. Turns out HTTP has a way to indicate that, but WSGI outright forbids its use. So instead you have to invent out-of-band mechanisms for the application to tell the server what to do, and in the process reinvent part of HTTP. It doesn't have to be out of band; CherryPy's wsgiserver will send a response chunked if the application provides no Content-Length response header. if status == 413: # Request Entity Too Large. Close conn to avoid garbage. self.close_connection = True elif content-length not in hkeys: # All 1xx (informational), 204 (no content), # and 304 (not modified) responses MUST NOT # include a message-body. So no point chunking. if status 200 or status in (204, 205, 304): pass else: if self.response_protocol == 'HTTP/1.1': # Use the chunked transfer-coding self.chunked_write = True self.outheaders.append((Transfer-Encoding, chunked)) else: # Closing the conn is the only way to determine len. self.close_connection = True Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
On Aug 4, 2009, at 8:53 PM, Graham Dumpleton wrote: 2. How would use of bytes work for a CGI-WSGI bridge given that os.environ is not bytes? Where does one get what encoding was used for os.environ values so it can be converted back to bytes? On Unix it's simple enough: On py2.X on Unix: environ is bytes already. On py3.0: you're screwed, because some env vars were discarded already. On py3.1+: 'string'.encode(sys.getfilesystemencoding(), 'surrogateescape') should do it. On Windows, I guess the OS environment is unicode, so, I don't know precisely what to do to reversibly obtain the bytes sent from the end- users's browser. It looks to me from source code as if Apache will encode the bytes from the client (utf-8 or otherwise!) as the Unicode values 0x00 to 0xFF in the windows environment, that is, as if decoding the client input in latin-1. But it does that for the following keys only: HTTP_* SERVER_* REQUEST_* QUERY_STRING PATH_INFO PATH_TRANSLATED (from http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/arch/win32/mod_win32.c) Other values are decoded from utf-8 (or, if passed through from an enclosing environment, passed through untouched -- via encoding into utf-8 for internal use and then decoding back from utf-8 to put back in the Windows environment.) I'll note that while it's important to get this transformation correct for a CGI-WSGI bridge to work right in Windows, and thus is definitely a useful discussion to have here, it doesn't actually need to be part of the WSGI spec. James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
At 10:53 AM 8/5/2009 +1000, Graham Dumpleton wrote: Now, the main reason why I am throwing around alternate suggestions in the first place is that last time although people seem to be comfortable moving along with the idea of latin-1 everywhere, I knew of some who weren't happy with that, some not on the list, and who believed it should be bytes, but they weren't speaking up. I suspect that this was all a confusion to begin with; the primary function of Latin-1 in WSGI has been a way to represent bytes when all you have to represent them with is unicode strings. So, even when we've been talking Latin-1, what we really mean is bytes. ;-) In general, I think we want to require that servers must provide bytes, and accept both bytes and Latin-1 (maybe just ASCII?) strings. (I don't see a problem with environ keys being strings, though, since all the WSGI or CGI-defined keys are pure ASCII anyway. But I could just as easily go with bytes everywhere; I assume Py3 treats all-ascii byte strings and the equivalent unicode as being equal and hashing alike.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
[Web-SIG] WSGI 2
So... what about WSGI 2? Let's not completely drop the ball on this. I *think* we were largely in agreement; debate got distracted by some async stuff, but I don't think we particularly have to deal with that for WSGI 2. I think we do more than enough if we figure out: WSGI in Python 3, i.e., with unicode; some basic errata kind of stuff, like readline signature; change the callable signature to remove start_response. Would this be a new PEP or a revision? I think it should be a new PEP, as WSGI 1 remains valid and the same as it always was, and PEP 333 describes that. Is there anyone willing to make the revisions? -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote: Would this be a new PEP or a revision? I think it should be a new PEP, as WSGI 1 remains valid and the same as it always was, and PEP 333 describes that. +1 for a new PEP, since we'd be able to drop a lot of crufty examples and explanations about the cruddy bits. wsgiref should add 1-2 and 2-1 adapters. (Although technically, running a WSGI 1 application in a WSGI 2 server requires either threads or greenlets.) IMO, the main benefit of implementing WSGI 2 is to applications, not servers, with the possible exception of async servers (e.g. Twisted) that would prefer an iterator-only communications mode. Such servers could refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2-1 adapter. Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a standard 1-2 adapter to support WSGI 2. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
2009/8/4 Mark Ramm mark.mchristen...@gmail.com: In summary, just seems more sane to have stuff in WSGI environment be dealt with as UTF-8. This sounds good to me. Rack, Jack, and even java servlets seem to make this assumption without significant trouble, and if nearly all existing web servers do it internally, that's seems like an even better argument. What do they do for response side though? Do they have the bytes/string distinct that we are talking about, with bytes expected by string accepted but only in representable as latin-1? Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
At 10:48 AM 8/4/2009 +1000, Graham Dumpleton wrote: 2009/8/4 P.J. Eby p...@telecommunity.com: At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote: Would this be a new PEP or a revision? I think it should be a new PEP, as WSGI 1 remains valid and the same as it always was, and PEP 333 describes that. +1 for a new PEP, since we'd be able to drop a lot of crufty examples and explanations about the cruddy bits. wsgiref should add 1-2 and 2-1 adapters. (Although technically, running a WSGI 1 application in a WSGI 2 server requires either threads or greenlets.) IMO, the main benefit of implementing WSGI 2 is to applications, not servers, with the possible exception of async servers (e.g. Twisted) that would prefer an iterator-only communications mode. Such servers could refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2-1 adapter. Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a standard 1-2 adapter to support WSGI 2. Personally I don't believe we should be trying to support async servers in the WSGI specification. I'm not suggesting adding anything for async servers; I'm just saying that they will likely prefer to use WSGI 2 and use a 2-1 adapter to do WSGI 1 support, whereas synchronous servers will likely prefer the reverse. The WSGI spec doesn't currently require streaming upload support, so if an async server wants to buffer the input (e.g. to a temp file) rather than trusting the application to handle reads, it's free to do so. (And that's independent of whether it's WSGI 1 or 2 being used.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2
At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote: 1. When running under Python 3, applications SHOULD produce bytes output, status line and headers. This is effectively what we had before. The only difference is that clarify that the 'status line' values should also be bytes. This wasn't noted before. I had already updated the proposed WSGI 1.0 amendments page to mention this. +1 2. When running under Python 3, servers and gateways MUST accept strings for output, status line and headers. Such strings must be converted to bytes output using 'latin-1'. If string cannot be converted then is treated as an error. This is again what we had before except that mention 'status line' value. 3. When running under Python 3, servers MUST provide wsgi.input as a binary (byte) input stream. No change here. 4. When running under Python 3, servers MUST provide a text stream for wsgi.errors. In converting this to a byte stream for writing to a file, the default encoding would be applied. No real change here except to clarify that default encoding would apply. Use of default encoding though could be problematic if combining different WSGI components. This is because each WSGI component may have been developed on system with different default encoding and so one may expect to log characters that can't be written on a different setup. Not sure how you could solve that except to say people have default encoding be UTF-8 for portability. Also +1. 5. When running under Python 3, servers MUST provide CGI HTTP and server variables as strings. Where such values are sourced from a byte string, be that a Python byte string or C string, they should be converted as 'UTF-8'. If a specific web server infrastructure is able to support different encodings, then the WSGI adapter MAY provide a way for a user of the WSGI adapter to customise on a global basis, or on a per value basis what encoding is used, but this is entirely optional. Note that there is no requirement to deal with RFC 2047. This is where I am going to diverge from what has been discussed before. The reason I am going to pass as UTF-8 and not latin-1 is that it looks like Apache effectively only supports use of UTF-8. Since this means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and even CGI likely cannot handle anything besides UTF-8 then I really can't see the point of trying to cater for a theoretical possibility that some HTTP client could use something besides UTF-8. In other words, the predominant case will be UTF-8, so let us target that. So, rather than burden every WSGI application with the need to convert from latin-1 back to bytes and then to UTF-8, let the server deal with it, with server using sensible default, and where server infrastructure can handle a different encoding, then it can provide option to use that encoding and WSGI application doesn't need to change. Maybe I'm missing something here, but what if Apache receives something encoded in Latin-1? AFAIR, form POST encoding is determined by the encoding of the page containing the form; that's of course something that only happens in the input body, but what about URLs? Mainly I'm wondering, what should the server do in the event they receive a byte string which is not valid UTF-8? (Latin-1 doesn't have this problem, since there's no such thing as an invalid Latin-1 string, at least not at the encoding level.) Also shown though that SCRIPT_NAME part has to be UTF-8 and we would really be entering fantasy land if you were somehow going to cope with some different encoding for PATH_INFO and QUERY_STRING. Instead it is like the GPL, viral in nature. Use of UTF-8 in one particular area means you are effectively bound to use UTF-8 everywhere else. I'm not clear on your logic here. If I request foo/bar/baz (where baz actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the script, then the (accented) baz is legitimate for pass-through to the application, no? I just tried testing this with Firefox and Apache, and found that you can in fact pass such Latin-1 strings through to PATH_INFO, but at least in the case of Firefox, you have to %-escape them. However, they are seen by Python (via os.environ) as latin-1 encoded byte strings. Further example of why UTF-8 reaches into everything is mod_rewrite module for Apache. This allows you to do stuff related to SCRIPT_NAME, PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache configuration file has to be UTF-8. If URL isn't, then wouldn't be possible to perform matches against non latin-1 characters in a rewrite condition or rule. This is because your match string would be in different encoded form to that in URL and so wouldn't match. Note that this still doesn't have any impact on the bytes that actually reach the application, which can be non-UTF8. At minimum, the proposal is underspecified as to how to handle this case, which is as trivial to generate as