Re: [Web-SIG] WSGI 2: Decoding the Request-URI
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote: However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading / character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least not without their specification becoming unwieldy. (Just to make things clear, I am not just talking about REQUEST_URI here, but all request headers) Encoding everything using ISO-8859-1 has the nice property of keeping informations intact. It would be good heuristic if everything with a few exceptions was encoded using ISO-8859-1. Just transcode the few problematic cases at the application level and everybody is happy. A string encoded from ISO-8859-1 is like a bytes object with a string 'interface' on top of it. But it sweep the encoding problem under the carpet. The problem with Python 2 was that str and unicode were almost the same, so much the same that it was possible to mix them without too much problems: 'foo' == u'foo' True Python 3 made bytes and string 'incompatible' to force programmers to handle the encoding problem as soon as possible: b'foo' == 'foo' False By passing `str()` to the application, the application author could believe that the encoding problem has been handled. But in most cases it hasn't been handled at all. The application author should still transcode all the strings incorrectly encoded. We are back to Python 2's bad old days, where we can't be sure that what we got is properly encoded: Was that string encoded using latin-1? Maybe a middleware transcoded it to UTF-8 before the application was called. Maybe the application itself transcoded it at some point, but then we need to keep track of what was transcoded. Maybe the application should transcode everything when it is called. Also EVERY application author will have to read the PEP, especially the paragraph saying: Everything we give you are strings, but you still have to deal with the encoding mess. Otherwise he will have weird problems like when he was using Python 2. Because the interface is not clear. strings are supposed to be text and only text. Encoding everything to ISO-8859-1 means strings are not text anymore, they are 'encoded data' [1]. bytes are supposed to be 'encoded data' and binary blobs. By giving applications bytes, the author knows right away he should decode them. No need to read the PEP. `bytes` can do everything `str` can do with the notable exception of 'format'. b'foo bar'.title() b'Foo Bar' b'/foo/bar/fran\xc3ois'.split(b'/') [b'', b'foo', b'bar', b'fran\xc3ois'] re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups() (b'foo', b'1234') I understand that `bytes()` is an unfamiliar beast. But I believe the encoding problem is the realm of the application, not the realm of the gateway. Let the application handle the encoding problem and don't give it a half baked solution. Using bytes also has its set of problems. The standard library doesn't support bytes very well. For example urllib.response.unquote() doesn't work with bytes, and urllib.parse too has issues. [1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit -- Henry Pr?cheur ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2: Decoding the Request-URI
I wrote: Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's. ... As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application. ... In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding. In other words, the URI spec seems to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard. ... For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Did I say 3 reasons? I meant 4: Accept-Charset. Robert Brewer fuman...@aminus.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI 2: Decoding the Request-URI
At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote: Did I say 3 reasons? I meant 4: Accept-Charset. Chief amongst the reasons... amongst our reasonry... Right, we'll come in again. ;-) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
[Web-SIG] WSGI 2: Decoding the Request-URI
I wrote: PATH_INFO and QUERY_STRING are ... decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. and Ian replied: My understanding is that PATH_INFO *should* be UTF-8 regardless of what encoding a page might be in. At least that's what I got when testing Firefox. It might not be valid UTF-8 if it was manually constructed, but then there's little reason to think it is valid... Actually, current browsers tend to use UTF-8 for the path, and either the encoding of the document [1] or Windows-1252 [2] for the querystring. But the vast majority of HTTP user agents are not browsers [3]. Even if that were not so, we should not define WSGI to only interoperate with the most current browsers. and Graham added: Thinking about it for a while, I get the feel that having a fallback to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That URLs wouldn't consistently use the same encoding all the time just seems wrong. I would see it as returning a bad request status. If an application coder knows they are actually going to be dealing with latin-1, as that is how the application is written, then they should be specifying it should be latin-1 always instead of utf-8. Thus, the WSGI adapter should provide a means to override what encoding is used. Applications do produce URI's (and IRI's, etc. that need to be converted into URI's) and do transfer them in media types like HTML, which define how to encode a.href's and form.action's before %-encoding them [4]. But these are not the only vectors by which clients obtain or generate Request-URI's. For simple WSGI adapters which only service one WGSI application, then it would apply to whole URL namespace. As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a filename encoding defined by the OS which is different than that of the rest of the URI's generated/understood by even the most coherent application. The encoding used for a URI is only really important for one reason: URI comparison. Comparison is at the heart of handler dispatch, static resource identification, and proper HTTP cache operation. It is for these reasons that RFC 3986 has an extensive section on the matter [5], including a ladder of approaches: * Simple String Comparison * Case Normalization (e.g. /a%3D == /a%3d) * Percent-Encoding Normalization (e.g. /a%62c == /abc) * Path Segment Normalization (e.g. /abc/../def == /def) * Scheme-Based Normalization (e.g. http://example.com == http://example.com:80/) * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing showed it to be) I think it would be beneficial to those who develop WSGI application interfaces to be able to assume that at least case-, percent-, path-, and scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by all WSGI 2 origin servers. All of those except for the first one can be accomplished without decoding the target URI. But that first section specifically states: In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding. In other words, the URI spec seems to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ values must be byte strings. IMO WSGI 2 should do better in this regard. For something like Apache where could map to multiple WSGI applications, then it may want to provide means of overriding encoding for specific subsets o URLs, ie., using Location directive for example. For the three reasons above, I don't think we can assume that the application will always receive equivalent URI's encoded in a single, foreseen encoding. Yet we still haven't answered the question of how to handle unforeseen encodings. You're right that, if the server-side stack as a whole cannot map a particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 over 400, but either is fine. However, we quite often use only a portion of the URI when attempting to locate an appropriate handler; sometimes just the leading / character! The remaining characters are often passed as function arguments to the handler, or stuck in some parameter list/dict. In many cases, the charset used to decode these values either: is unimportant; follows complex rules from one resource to another; or is merely reencoded, since the application really does care about bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI environ entry to declare the charset which was used to decode) can handle all of these cases. Server configuration options cannot, at least