Le jeudi 06 janvier 2011 à 23:50 +0000, And Clover a écrit : > On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: > > What is this horrible encoding "bytes-as-unicode"? > > It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 > is the encoding specified by the HTTP RFC, as well as having the happy > property of preserving every input byte. PEP 3333 requires it.
ISO-8859-1 for all fields: SERVER_NAME, PATH_INFO, the URL, form data, ...? > > os.environ is supposed to be correctly decoded and contain valid > unicode characters. > > It is not possible to ‘correctly’ decode to unicode for os.environ > because that decoding happens long before the web application (the > only party that knows what encoding should be in use) gets a look in. Agreed. > Maybe the web application is using UTF-8, maybe it's using cp1252, > but if we let the server/gateway decide and do that decoding (...) > It's an absolutely necessary idea. The locale encoding is nothing > to do with the web application's encoding. (...) Ok, so you must pass byte strings to the server/gateway. If you pass unicode, how do the server/gateway know that it has to redecode a value? Should it redecode all values? Anything, it is stupid to use a temporary useless pseudo-encoding (bytes-in-unicode). > The recoding dances present in wsgiref's CGIHandler for 3.2 are > distasteful but completely necessary to normalise differences in > encodings used by various servers and platforms to generate their CGI > environment. I don't understand why read_environ() gives unicode values: as you explained, the server/gateway will have to encode the values again, and then finally to decode them from the correct encoding. On POSIX, the current code looks like that: a) the OS pass a bytes environ to the program b) Python decodes environ from the locale encoding c) wsgi.read_environ() encodes environ to the locale encoding to get back the original bytes environ: this step can be skipped if os.environb is available d) wsgi.read_environ() decodes environ from ISO-8859-1 e) the server/gateway encodes environ to ISO-8859-1 f) the server/gateway decodes environ from the right encoding Hey! Don't you think that there are useless encode/decode steps here? Especially (d)-(e) is useless and introduces a confusion: the environ contains other keys that don't come from os.environ and are already correctly decoded, how do the the server/gateway know that they are already correctly decoded? I propose simply (for Python 3.2): a) the OS pass a bytes environ to the program: wsgi.read_environ() uses it b) the server/gateway decodes environ from the right encoding and... > (a) os.environb doesn't exist in previous Python 3.1, making it > impossible to implement WSGI before 3.2; For Python 3.1, add a step between (a) and (b): encode environ to the locale encoding (with surrogateescape) to get back the original bytes environ. > (b) a byte environment on Windows would have to be encoded > from the Unicode environment, with a server-specific encoding, > and then what encoding are you going to choose for the variables > that contain non-HTTP-sourced native Unicode strings (such as, > very commonly, Windows pathnames)? The variables coming from the HTTP server should be encoded again to the server-specific encoding. Other variables should be kept unchanged. The server/gateway can simply test the type of the variable: if it's uncode, nothing to do, if it's bytes: decode it from the correct encoding. > The bytes-or-bytes-in-Unicode argument is something that has been > bounced around Web-SIG for literally *years*; (...) WSGI and wsgiref > in Python 3.0-3.1 simply does not work. I don't understand why you are attached to this horrible hack (bytes-in-unicode). It introduces more work and more confusing than using raw bytes unchanged. It doesn't work and so something has to be changed. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com