On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote: > The decoding doesn't change spontaneously. > You either get the correct one or you get an incorrect one. If it's > incorrect, you fix it, one time, via a WSGI component which you've > configured to determine the "correct" decoding. Then every other WSGI > component "below" that one can go back to trusting the decoding was > correct. In fact, if you do that transcoding right away, no other WSGI > components need to be rewritten to take advantage of unicode. You just > have to deploy a single transcoder, that's 6 lines of code max.
And you can do that with utf8+surrogateescape too. Except that you don't have to determine what encoding the gateway sent you, it's always utf8+surrogateescape. > With utf8+surrogateescape, you don't transcode once, you transcode in > every WSGI component in your stack that needs to "correct" the > decoding. You have to do it more than once because, each time you > encode/re-decode, you use the result and then throw it away. Any > subsequent WSGI components have to encode/re-decode--you cannot store > the redecoded URI in SCRIPT_NAME/PATH_INFO, because the > utf8+surrogateescape scheme says...well, it's always utf8-decoded. You don't get something REALLY important with surrogateescape: You can ALWAYS get the original bytes back. >>> b = b'fran\xe7cois' >>> s = b.decode('utf8', 'surrogateescape') >>> s 'fran\udce7cois' >>> s.encode('utf8', 'surrogateescape') b'fran\xe7cois' See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a normal UTF-8 character, this character use some 'free space' in the unicode supplementary characters. The only thing you have to do is to pass 'surrogateescape' each time you call encode/decode. > In addition, *every* component that needs to compare URI's then has to > be configured with the same logic, however convoluted, to perform the > "correct" decoding again. It's not just routing middleware: caches > need to reliably compare decoded URI's; so do sessions; so does auth > (especially!); so do static files. And Heaven forfend you actually > decode differently in two different components! I don't understand why I would need to throw away the decoded string. This works perfectly well a far as I know: environ['PATH_INFO'] = environ['PATH_INFO'].\ encode('utf8', 'surrogateescape').\ decode(SITE_ENCODING) utf8+surrogateescape provides the same possibilities as wsgi.uri_encoding. You can transcode without losing information when you know what the correct encoding is. But utf8+surrogateescape is simpler because there's no need to pass around the name of the encoding in an additional variable. -- Henry PrĂȘcheur _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com