On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: > What is this horrible encoding "bytes-as-unicode"?
It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 is the encoding specified by the HTTP RFC, as well as having the happy property of preserving every input byte. PEP 3333 requires it. > os.environ is supposed to be correctly decoded and contain valid unicode characters. It is not possible to ‘correctly’ decode to unicode for os.environ because that decoding happens long before the web application (the only party that knows what encoding should be in use) gets a look in. Maybe the web application is using UTF-8, maybe it's using cp1252, but if we let the server/gateway decide and do that decoding before the application can do anything about it, we will get the wrong encoding in *many* cases and the result will be permanent, unrecoverable mangling of non-ASCII characters in submitted headers. > If WSGI uses another encoding than the locale encoding (which is a bad idea), It's an absolutely necessary idea. The locale encoding is nothing to do with the web application's encoding. Windows applications need to be able to use UTF-8 (which is never the ANSI code page), and web applications in general need to be deployable to any server without having to worry about the server's locale. The locale-dependent status quo is that non-ASCII characters in URL paths and other HTTP headers don't work for Python apps. The recoding dances present in wsgiref's CGIHandler for 3.2 are distasteful but completely necessary to normalise differences in encodings used by various servers and platforms to generate their CGI environment. > it should use os.environb and decodes keys and values using its > own encoding. Well yes, but: (a) os.environb doesn't exist in previous Python 3.1, making it impossible to implement WSGI before 3.2; (b) a byte environment on Windows would have to be encoded from the Unicode environment, with a server-specific encoding, and then what encoding are you going to choose for the variables that contain non-HTTP-sourced native Unicode strings (such as, very commonly, Windows pathnames)? The bytes-or-bytes-in-Unicode argument is something that has been bounced around Web-SIG for literally *years*; this is what we ended up with. Although I personally like bytes, frankly, a re-run of this argument *again* whilst WSGI remains in perpetual stalemate does not appeal. WSGI and wsgiref in Python 3.0-3.1 simply does not work. This has long been an embarrassing situation for what is supposed to be a leading web language. Let us not perpetuate this sorry story to 3.2 as well. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com skype:uknrbobince gtalk:chat?jid=bobi...@gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com