Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.


See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode. In other words, the values will be 
unicode, but that will simply be a lie.


Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.


Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.


I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.


I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.


An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.


And of course we get support through mod_cgi and mod_wsgi automatically, 
so Graham doesn't have to do anything. :-)


Graham Dumpleton wrote:


I can't really remember what the outcome of the discussion was.


Not too much outcome really, unfortunately! You concluded:


there possibly still is an open question there on how
encoding of non ascii characters works in practice. We just need to
do some actual tests to see what happens and whether there is a problem. 


...to which the answer is — judging by the results posted — probably 
“yes”, I'm afraid!


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Ian Bicking

Andrew Clover wrote:

Ian Bicking wrote:

As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.


See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.


This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?) -- it's mostly 
irrelevant to WSGI itself.


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode. In other words, the values will be 
unicode, but that will simply be a lie.


Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)


As far as I know, PJE just made the suggestion about Latin-1, I don't 
know if anything has actually been done in wsgiref or elsewhere to 
implement that.  Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the 
WSGI spec itself.


Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.


This will presumably require hacks that might be system-dependent. 
Probably the current CGI adapter will just have to be a bit more 
complicated.  Also, if Python is utf8-decoding the environment, we'll 
just have to shortcut that entirely, as you can't just undo utf8.  I 
assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.


I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.


The encoding of the operating system (which presumably informs the 
encoding of os.environ) has nothing to do with the encoding of the web 
application.  For the CGI adapter we simply need to find a way to ignore 
the system encoding.


I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.


Unfortunately REQUEST_URI doesn't map directly to SCRIPT_NAME/PATH_INFO. 
 I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on), 
moving from the two keys to a single REQUEST_URI is not feasible.


It's not that trivial to figure out where in REQUEST_URI the 
SCRIPT_NAME/PATH_INFO boundary really is, as there's many ways the 
unencoded values could be encoded.  I guess you'd probably count 
segments, try to catch %2f (where the segments won't match up), and then 
double check that the decoded REQUEST_URI matches SCRIPT_NAME+PATH_INFO.


An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.


I use the distinction between SCRIPT_NAME and PATH_INFO extensively. 
And frankly IIS is probably less relevant to most developers than CGI. 
Anyway, any of these bugs are things that need to be fixed in the WSGI 
adapter, we must not let them propagate into the specification or 
applications.  So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.


And of course we get support through mod_cgi and mod_wsgi automatically, 
so 

Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?)


Python decodes the environ to its own copy (wrapped in os.environ) at 
interpreter startup time; there's no way to query the real ‘live’ 
environment that I know of. It'd require a C extension.


Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.


I know Graham has done some work on mod_wsgi for 3.0, but no, I don't 
know anyone using it in anger.


Is it worth submitting patches to simple_server to make it run on 3.0? 
Is it too late to include at this stage anyway? Shipping 3.0 with a 
non-functional wsgiref is a bit embarrassing.


I assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.


There is not, and this appears to be deliberate.

I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on),

moving from the two keys to a single REQUEST_URI is not feasible.


That's certainly a possibility, but I feel it's easier to hitch a ride 
on the existing header, which despite being non-standard is still quite 
widely used.



I guess you'd probably count segments, try to catch %2f (where the
segments won't match up), and then double check that the decoded
REQUEST_URI matches SCRIPT_NAME+PATH_INFO.


I'm currently testing with just the segment counting. It's only 
necessary that the segments from SCRIPT_NAME are matched and stripped, 
and those are extremely unlikely to contain ‘%2F’ because:


  - there aren't many filesystems that can accept ‘/’ as a filename
character. RISC OS is the only one I can think of, and it by
convention swaps ‘/’ and ‘.’ to compensate as it is, so even
there you couldn't use ‘%2F’;
  - there aren't many webservers that can map a file or alias to a
path containing ‘%2F’;
  - no-one wants to mount a webapp alias at such a weird name — it's
only in the section corresponding to PATH_INFO that ‘%2F’ might
ever be of use in practice.

In the worst case, many applications already know and can strip the URL 
at which they're mounted, but unless there's a legitimate ‘%2F’ in their 
SCRIPT_NAME it doesn't actually matter.


frankly IIS is probably less relevant to most developers than CGI. 


Er... really?

You and I may not favour it, but it's ≈35% of the world out there, not 
something we can afford to ignore IMO.


So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.


What I'm saying is that neither Apache's nor IIS's behaviour can be 
considered clearly correct or wrong at this point, and there is no way a 
WSGI adapter living underneath them *can* fix up the differences.


(There is an problem with PATH_INFO that a WSGI adapter *could* clear 
up, which is that IIS makes PATH_INFO the entire path including 
SCRIPT_NAME. I'm not sure whether it's worth fixing that up in the 
adapter layer though... it's possible some frameworks are already 
dealing with it, and might even be relying on it!)


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com