Hi, I spent the last few hours now figuring out what decisions Python took in the standard library to get a better understanding of unicode in Python 3 and how it affects web applications.
Let's sum up the current state of encodings in the web world: RFC 2616 specifies the header encoding as "latin1" (or iso-8859-1). The majority of header values is ASCII only, the only exception except for custom headers and stuff like the server name, are the cookie headers. Cookie headers are problematic for other reasons as well because some browsers (IE for example) have different ideas of cookies than others. I've seen many people using utf-8 encoded cookie values, so it's pretty common to have headers with values outside the latin1 range. However to remind everybody: latin1 can carry invalid encoded utf-8 without loss of precision if you do the encode/decode dance. For URIs/IRIs there is a bit of a problem as well. URLs are encodingless but limited to ASCII. Values outside of the ASCII range have to be %-encoded, but nowhere is the charset specified. Browsers changed the URL encoding behavior to utf-8 a few years ago (I think with Firefox 1.5 or Firefox 2, Mozilla changed it). They are still trying latin1 as well if they are totally clueless and get a 404 or something. I'm not exactly sure how that is supposed to work. The new thing are IRIs. They can contain any non-ASCII characters and are considered being UTF-8. It is possible to quote utf-8 encoded charpoints with %-encoding. IRIs might also contain unicode identifiers for the hostname, for URIs this appears to be idna/puny encoded. Eg: IRI: http://üser:pässw...@☃.net/påth URI: http://%C3%BCser:p%c3%a4ssw...@xn--n3h.net/p%C3%A5th There are already Python implementations to work convert between URIs and IRIs (for example in Werkzeug 0.6). Form data: Form data is encoded by all browsers in the charset of the page that renders the page. However for missing encoding declarations in the HTTP header, the browser runs a character set guessing algorithm. This algorithm is currently browser dependent but might be specified as part of HTML5. At least there is a section in the draft currently. This is a lot of charsets. So for most applications the charsets look like this: page encoding: utf-8 headers: invalid latin1 with utf-8 payload form submissions: utf-8 urls: utf-8 This is also the only configuration that looks reasonable, all the others fall to utf-8 on modern browsers every once in a while (for example if an IRI is used in an HTML document on an external resource, the browser will try utf-8 for the URL, even if that URL is in fact latin1). For Python 3, the standard library the safe path and chose utf-8 as standard encoding for URLs. The biggest grief I have with this is that URLs have to be 'str' in Python 3 (remember, that's unicode). This works and is probably a step into a better direction, but I would welcome the addition of an IRI module and advertise the use of IRIs internally. (For the 'bytes' problems see further below) Other situation where the standard library decided to went with unicode instead of bytes is the HTTP server and clients. There Python assumes latin1 for headers (which is correct on the paper). Unfortunately that complicates things a lot. Graham is right about mentioning that operating on bytes in Python 3 is a lot harder than it was in Python 2. And I'm not even talking about the missing implicit conversion, but missing functionality on the bytes. Here some common idioms found in low-level WSGI code that no longer works: String formatting: >>> b"%d %s" % (200, "OK") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple' Integer to ASCII: >>> bytes(8) b'\x00\x00\x00\x00\x00\x00\x00\x00' >>> bytes(str(8)) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: string argument without an encoding >>> str(8).encode("ascii") b'8' urllib.parse appears to be buggy with bytestrings: >>> parse.quote_plus('föö'.encode('utf-8')) 'f%C3%B6%C3%B6' >>> parse.unquote_plus('f%C3%B6%C3%B6') 'föö' >>> parse.unquote_plus(b'f%C3%B6%C3%B6') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\python31\lib\urllib\parse.py", line 404, in unquote_plus string = string.replace('+', ' ') TypeError: expected an object with the buffer interface I'm pretty sure the latter is a bug and I will file one, however if there is broken behavior with bytestrings in Python 3.1 that's another thing we have to keep in mind. Form data handling in Python 3 based on cgi.FieldStorage currently also assumes unicode strings and from what I've read so far, it doesn't work in Python 3.1, but I have not confirmed that. In my oppinion it was a mistake to force the unicode behavior on these parts in the standard library, but now it happened and that affects the WSGI specification as well now. Based on what I've read in the code so far, I'm pretty sure we have to find some statistics about how many non utf-8 applications still exist in the wild and if we have use cases where the raw bytes are necessary. Unfortunately the bytes approach does not sound that easy to implement any more, based on the fact that the standard library no longer supports bytes for many lower level operations and that the bytes object does not provide any sort of string formattings. However, that does not make the unicode approach any less evil. Unless we have found a way that properly supports unicode in a way that we're not losing information and that makes ports of applications possible I'm strongly against it. Regards, Armin _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com