Re: [Web-SIG] CGI WSGI and Unicode

2009-12-08 Thread And Clover

Manlio Perillo wrote:


In a CGI application, HTTP headers are Unicode strings, and are decoded
using system default encoding.



In a future WSGI application, HTTP headers are Unicode strings, and are
decoded using latin-1 encoding.


Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the 
decode stage caused by reading environ using the default encoding. At 
least this is now reliably possible thanks to surrogateescape.


PATH_INFO is the only really important HTTP-related environment variable 
for Unicode. Potentially SCRIPT_NAME could also be significant in 
relation to PATH_INFO. The HTTP headers don't massively matter because 
there are almost never any non-ASCII characters in them.


Previously the job of undoing an unwanted decode step was dumped on 
whatever read the PATH_INFO; usually a routing component, which would 
have to make guesses with typically poor results. The CGI adapter is in 
a much better place to do it, being closer to the server.


 The problem is that not all browsers use latin-1.

Not WSGI's problem. WSGI will deliver bytes encoded into Unicode 
strings, not ready-to-use Unicode strings. It is up to the application 
to decide how they want to handle those bytes; maybe they want Latin-1 
and can do nothing, maybe they want to recode to UTF-8, maybe something 
else completely. No solution satisfies every app so there is always 
going to have to be a recode step somewhere.


An application that doesn't want to think about this will use a 
framework that does it for them.


 What about HTTP_COOKIE?

For what it's worth, the choice of Latin-1 here results in the 'right' 
Unicode string for more browsers than any other potential encoding.


In any case as previously discussed, non-ASCII cookies are already 
totally broken everywhere and hence used by no-one.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] CGI WSGI and Unicode

2009-12-07 Thread Manlio Perillo
Graham Dumpleton ha scritto:

Note: I'm sending the entire message to the mailing list.

 2009/12/7 Manlio Perillo manlio_peri...@libero.it:
 Hi.

 I'm playing with Python 3.x, current revision.

 I have noted that the data in the os.environ are noe Unicode strings.

 In a CGI application, HTTP headers are Unicode strings, and are decoded
 using system default encoding.
 In a future WSGI application, HTTP headers are Unicode strings, and are
 decoded using latin-1 encoding.

 In both cases, 'surrogateescape' is used.
 
 No, 'surrogateescape' is not necessary when using latin-1, or at least
 for variables which use latin-1.
 

The problem is that not all browsers use latin-1.
As an example with HTTP Digest authentication.

 Use of 'surrogateescape' is only relevant in the context of some web
 servers and only relevant for specific variables, some of which aren't
 even part of set of variables which are required by WSGI.
 
 For example, in Apache/mod_wsgi, 'surrogateescape' is used on
 DOCUMENT_ROOT and SCRIPT_FILENAME. 

What about HTTP_COOKIE?

 [...] 
 Can this cause troubles and incompatibility problems?
 I'm interested in special header handling, like cookies, that contain
 opaque data.
 
 The issues which CGI/WSGI bridge in Python 3.X has been discussed
 previously on the list. 

It seems I missed it.

 It is acknowledged that there are problems to
 be solved there, at least to extent that CGI/WSGI bridge
 implementation has to correct the encoding, and also that that may
 only be solvable in Python 3.1 onwards due to not having access to
 what encoding was use for environment variables in Python 3.0. Not
 many people care about CGI these days and so no one has been bother to
 come up with working CGI/WSGI bridge for Python 3.X.
 

CGI is very important; there are some kind of web applications that have
problems when executing in a long running process.

As an example, I prefer to run Trac and Mercurial instances as CGI.

 Graham


Regards  Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] CGI WSGI and Unicode

2009-12-07 Thread Graham Dumpleton
2009/12/7 Manlio Perillo manlio_peri...@libero.it:
 Graham Dumpleton ha scritto:

 Note: I'm sending the entire message to the mailing list.

 2009/12/7 Manlio Perillo manlio_peri...@libero.it:
 Hi.

 I'm playing with Python 3.x, current revision.

 I have noted that the data in the os.environ are noe Unicode strings.

 In a CGI application, HTTP headers are Unicode strings, and are decoded
 using system default encoding.
 In a future WSGI application, HTTP headers are Unicode strings, and are
 decoded using latin-1 encoding.

 In both cases, 'surrogateescape' is used.

 No, 'surrogateescape' is not necessary when using latin-1, or at least
 for variables which use latin-1.


 The problem is that not all browsers use latin-1.
 As an example with HTTP Digest authentication.

You seem to miss one important point. When converting bytes to unicode
as latin-1, the surrogate escape mechanism never comes into play. This
is because all byte values can be represented in latin-1 due it being
a single byte encoding which preserves the original bytes intact.

 Use of 'surrogateescape' is only relevant in the context of some web
 servers and only relevant for specific variables, some of which aren't
 even part of set of variables which are required by WSGI.

 For example, in Apache/mod_wsgi, 'surrogateescape' is used on
 DOCUMENT_ROOT and SCRIPT_FILENAME.

 What about HTTP_COOKIE?

You trimmed part of my response which is very important. For
DOCUMENT_ROOT and SCRIPT_FILENAME they must be dealt with per the
filesystem encoding and not latin-1. If you don't, the result may not
be compatible with input to file system routines in Python 3.1 which
sort of expect file system encoding plus surrogate escape.

As I say though, those variables aren't relevant to most WSGI hosting
mechanisms and even for those which the web server provides them,
nearly all WSGI applications will not care about them. In
Apache/mod_wsgi worry about them because Apache/mod_wsgi provides
features which allow one to define Apache style handlers based on file
type where the handler for the arbitrary file type is implemented as a
WSGI application. In that case the file the URL mapped to, ie.,
SCRIPT_FILENAME, is an arbitrary file and not a WSGI script file.

In the case of HTTP_COOKIE, as far as WSGI adapter goes it just
converts it to unicode as per latin-1. So, it is washing its hands of
what to do with it because it cannot know and only WSGI application
can. Because latin-1, no surrogate escape involved. In the WSGI
application where it knows what encoding may be used then the WSGI
application can convert back to bytes and to a different encoding,
using surrogate escape if it wants to to ensure no outright error if
bytes can't be represented in that alternate encoding.

 [...]
 Can this cause troubles and incompatibility problems?
 I'm interested in special header handling, like cookies, that contain
 opaque data.

 The issues which CGI/WSGI bridge in Python 3.X has been discussed
 previously on the list.

 It seems I missed it.

 It is acknowledged that there are problems to
 be solved there, at least to extent that CGI/WSGI bridge
 implementation has to correct the encoding, and also that that may
 only be solvable in Python 3.1 onwards due to not having access to
 what encoding was use for environment variables in Python 3.0. Not
 many people care about CGI these days and so no one has been bother to
 come up with working CGI/WSGI bridge for Python 3.X.


 CGI is very important; there are some kind of web applications that have
 problems when executing in a long running process.

 As an example, I prefer to run Trac and Mercurial instances as CGI.

Yes I agree that there are some valid uses of CGI/WSGI bridge although
those two aren't the ones I would have in mind.

For the record, CGI/WSGI adapters should also protect the original
stdin/stdout so WSGI application doesn't cause problems by using
'print' or do other odd stuff with input. I haven't seen a single
CGI/WSGI adapter which does it in a way that I would say is correct,
or at least robust against users doing stupid things, so encoding
issues aren't the only thing where CGI/WSGI adapters need work.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] CGI WSGI and Unicode

2009-12-07 Thread Aaron Watters


--- On Mon, 12/7/09, Graham Dumpleton graham.dumple...@gmail.com wrote:

 For the record, CGI/WSGI adapters should also protect the
 original
 stdin/stdout so WSGI application doesn't cause problems by
 using
 'print' or do other odd stuff with input. I haven't seen a
 single
 CGI/WSGI adapter which does it in a way that I would say is
 correct,
 or at least robust against users doing stupid things...

There is no fool proof software: fools are too clever
Doctor, it hurts when I do this.  Don't do that.

Some words of wisdom from folklore... (or if anyone knows
the correct attribution, please inform).
   -- Aaron Watters
  http://listtree.appspot.com
  http://whiffdoc.appspot.com

===
an apple every 8 hours
will keep 3 doctors away.  -- kliban


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com