Re: [Web-SIG] CGI WSGI and Unicode
Manlio Perillo wrote: In a CGI application, HTTP headers are Unicode strings, and are decoded using system default encoding. In a future WSGI application, HTTP headers are Unicode strings, and are decoded using latin-1 encoding. Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the decode stage caused by reading environ using the default encoding. At least this is now reliably possible thanks to surrogateescape. PATH_INFO is the only really important HTTP-related environment variable for Unicode. Potentially SCRIPT_NAME could also be significant in relation to PATH_INFO. The HTTP headers don't massively matter because there are almost never any non-ASCII characters in them. Previously the job of undoing an unwanted decode step was dumped on whatever read the PATH_INFO; usually a routing component, which would have to make guesses with typically poor results. The CGI adapter is in a much better place to do it, being closer to the server. The problem is that not all browsers use latin-1. Not WSGI's problem. WSGI will deliver bytes encoded into Unicode strings, not ready-to-use Unicode strings. It is up to the application to decide how they want to handle those bytes; maybe they want Latin-1 and can do nothing, maybe they want to recode to UTF-8, maybe something else completely. No solution satisfies every app so there is always going to have to be a recode step somewhere. An application that doesn't want to think about this will use a framework that does it for them. What about HTTP_COOKIE? For what it's worth, the choice of Latin-1 here results in the 'right' Unicode string for more browsers than any other potential encoding. In any case as previously discussed, non-ASCII cookies are already totally broken everywhere and hence used by no-one. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] CGI WSGI and Unicode
Graham Dumpleton ha scritto: Note: I'm sending the entire message to the mailing list. 2009/12/7 Manlio Perillo manlio_peri...@libero.it: Hi. I'm playing with Python 3.x, current revision. I have noted that the data in the os.environ are noe Unicode strings. In a CGI application, HTTP headers are Unicode strings, and are decoded using system default encoding. In a future WSGI application, HTTP headers are Unicode strings, and are decoded using latin-1 encoding. In both cases, 'surrogateescape' is used. No, 'surrogateescape' is not necessary when using latin-1, or at least for variables which use latin-1. The problem is that not all browsers use latin-1. As an example with HTTP Digest authentication. Use of 'surrogateescape' is only relevant in the context of some web servers and only relevant for specific variables, some of which aren't even part of set of variables which are required by WSGI. For example, in Apache/mod_wsgi, 'surrogateescape' is used on DOCUMENT_ROOT and SCRIPT_FILENAME. What about HTTP_COOKIE? [...] Can this cause troubles and incompatibility problems? I'm interested in special header handling, like cookies, that contain opaque data. The issues which CGI/WSGI bridge in Python 3.X has been discussed previously on the list. It seems I missed it. It is acknowledged that there are problems to be solved there, at least to extent that CGI/WSGI bridge implementation has to correct the encoding, and also that that may only be solvable in Python 3.1 onwards due to not having access to what encoding was use for environment variables in Python 3.0. Not many people care about CGI these days and so no one has been bother to come up with working CGI/WSGI bridge for Python 3.X. CGI is very important; there are some kind of web applications that have problems when executing in a long running process. As an example, I prefer to run Trac and Mercurial instances as CGI. Graham Regards Manlio ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] CGI WSGI and Unicode
2009/12/7 Manlio Perillo manlio_peri...@libero.it: Graham Dumpleton ha scritto: Note: I'm sending the entire message to the mailing list. 2009/12/7 Manlio Perillo manlio_peri...@libero.it: Hi. I'm playing with Python 3.x, current revision. I have noted that the data in the os.environ are noe Unicode strings. In a CGI application, HTTP headers are Unicode strings, and are decoded using system default encoding. In a future WSGI application, HTTP headers are Unicode strings, and are decoded using latin-1 encoding. In both cases, 'surrogateescape' is used. No, 'surrogateescape' is not necessary when using latin-1, or at least for variables which use latin-1. The problem is that not all browsers use latin-1. As an example with HTTP Digest authentication. You seem to miss one important point. When converting bytes to unicode as latin-1, the surrogate escape mechanism never comes into play. This is because all byte values can be represented in latin-1 due it being a single byte encoding which preserves the original bytes intact. Use of 'surrogateescape' is only relevant in the context of some web servers and only relevant for specific variables, some of which aren't even part of set of variables which are required by WSGI. For example, in Apache/mod_wsgi, 'surrogateescape' is used on DOCUMENT_ROOT and SCRIPT_FILENAME. What about HTTP_COOKIE? You trimmed part of my response which is very important. For DOCUMENT_ROOT and SCRIPT_FILENAME they must be dealt with per the filesystem encoding and not latin-1. If you don't, the result may not be compatible with input to file system routines in Python 3.1 which sort of expect file system encoding plus surrogate escape. As I say though, those variables aren't relevant to most WSGI hosting mechanisms and even for those which the web server provides them, nearly all WSGI applications will not care about them. In Apache/mod_wsgi worry about them because Apache/mod_wsgi provides features which allow one to define Apache style handlers based on file type where the handler for the arbitrary file type is implemented as a WSGI application. In that case the file the URL mapped to, ie., SCRIPT_FILENAME, is an arbitrary file and not a WSGI script file. In the case of HTTP_COOKIE, as far as WSGI adapter goes it just converts it to unicode as per latin-1. So, it is washing its hands of what to do with it because it cannot know and only WSGI application can. Because latin-1, no surrogate escape involved. In the WSGI application where it knows what encoding may be used then the WSGI application can convert back to bytes and to a different encoding, using surrogate escape if it wants to to ensure no outright error if bytes can't be represented in that alternate encoding. [...] Can this cause troubles and incompatibility problems? I'm interested in special header handling, like cookies, that contain opaque data. The issues which CGI/WSGI bridge in Python 3.X has been discussed previously on the list. It seems I missed it. It is acknowledged that there are problems to be solved there, at least to extent that CGI/WSGI bridge implementation has to correct the encoding, and also that that may only be solvable in Python 3.1 onwards due to not having access to what encoding was use for environment variables in Python 3.0. Not many people care about CGI these days and so no one has been bother to come up with working CGI/WSGI bridge for Python 3.X. CGI is very important; there are some kind of web applications that have problems when executing in a long running process. As an example, I prefer to run Trac and Mercurial instances as CGI. Yes I agree that there are some valid uses of CGI/WSGI bridge although those two aren't the ones I would have in mind. For the record, CGI/WSGI adapters should also protect the original stdin/stdout so WSGI application doesn't cause problems by using 'print' or do other odd stuff with input. I haven't seen a single CGI/WSGI adapter which does it in a way that I would say is correct, or at least robust against users doing stupid things, so encoding issues aren't the only thing where CGI/WSGI adapters need work. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] CGI WSGI and Unicode
--- On Mon, 12/7/09, Graham Dumpleton graham.dumple...@gmail.com wrote: For the record, CGI/WSGI adapters should also protect the original stdin/stdout so WSGI application doesn't cause problems by using 'print' or do other odd stuff with input. I haven't seen a single CGI/WSGI adapter which does it in a way that I would say is correct, or at least robust against users doing stupid things... There is no fool proof software: fools are too clever Doctor, it hurts when I do this. Don't do that. Some words of wisdom from folklore... (or if anyone knows the correct attribution, please inform). -- Aaron Watters http://listtree.appspot.com http://whiffdoc.appspot.com === an apple every 8 hours will keep 3 doctors away. -- kliban ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com