On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote: > However, we quite often use only a portion of the URI when attempting > to locate an appropriate handler; sometimes just the leading "/" > character! The remaining characters are often passed as function > arguments to the handler, or stuck in some parameter list/dict. In > many cases, the charset used to decode these values either: is > unimportant; follows complex rules from one resource to another; or is > merely reencoded, since the application really does care about bytes > and not characters. Falling back to ISO-8859-1 (and minting a new WSGI > environ entry to declare the charset which was used to decode) can > handle all of these cases. Server configuration options cannot, at > least not without their specification becoming unwieldy.
(Just to make things clear, I am not just talking about REQUEST_URI here, but all request headers) Encoding everything using ISO-8859-1 has the nice property of keeping informations intact. It would be good heuristic if everything with a few exceptions was encoded using ISO-8859-1. Just transcode the few problematic cases at the application level and everybody is happy. A string encoded from ISO-8859-1 is like a bytes object with a string 'interface' on top of it. But it sweep the encoding problem under the carpet. The problem with Python 2 was that str and unicode were almost the same, so much the same that it was possible to mix them without too much problems: >>> 'foo' == u'foo' True Python 3 made bytes and string 'incompatible' to force programmers to handle the encoding problem as soon as possible: >>> b'foo' == 'foo' False By passing `str()` to the application, the application author could believe that the encoding problem has been handled. But in most cases it hasn't been handled at all. The application author should still transcode all the strings incorrectly encoded. We are back to Python 2's bad old days, where we can't be sure that what we got is properly encoded: Was that string encoded using latin-1? Maybe a middleware transcoded it to UTF-8 before the application was called. Maybe the application itself transcoded it at some point, but then we need to keep track of what was transcoded. Maybe the application should transcode everything when it is called. Also EVERY application author will have to read the PEP, especially the paragraph saying: > Everything we give you are strings, but you still have to deal > with the encoding mess. Otherwise he will have weird problems like when he was using Python 2. Because the interface is not clear. strings are supposed to be text and only text. Encoding everything to ISO-8859-1 means strings are not text anymore, they are 'encoded data' [1]. bytes are supposed to be 'encoded data' and binary blobs. By giving applications bytes, the author knows right away he should decode them. No need to read the PEP. `bytes` can do everything `str` can do with the notable exception of 'format'. >>> b'foo bar'.title() b'Foo Bar' >>> b'/foo/bar/fran\xc3ois'.split(b'/') [b'', b'foo', b'bar', b'fran\xc3ois'] >>> re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups() (b'foo', b'1234') I understand that `bytes()` is an unfamiliar beast. But I believe the encoding problem is the realm of the application, not the realm of the gateway. Let the application handle the encoding problem and don't give it a half baked solution. Using bytes also has its set of problems. The standard library doesn't support bytes very well. For example urllib.response.unquote() doesn't work with bytes, and urllib.parse too has issues. [1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit -- Henry Pr?cheur
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com