Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-18 Thread Andrew Clover

ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...)


Hmm... it turns out: no. IIS appears to be mangling characters that are 
not in mbcs even *before* it puts the decoded value into the envvars.


The same is true with isapi_wsgi, which is the only other WSGI adapter I 
know of for IIS. This gets the same mangled byte string from 
GetServerVariable as Python gets from the envvars, so it looks like this 
is a mistake IIS is making further up before it even hits the CGI 
handler. Maybe someone more familiar with ISAPI knows a better way to 
read PATH_INFO than GetServerVariable, but I can't see anything 
promising in MSDN.


So it would seem to be impossible at the moment to have Unicode paths 
work under IIS at all.


The ctypes approach could rescue bytes for the Apache/nt/Py2 combination 
(perhaps also from libc.getenv for Apache/posix/Py3), but then Apache 
already gives us REQUEST_URI which is a much easier workaround. There 
might be CGI servers for Windows where ctypes could serve some purpose, 
but I can't think of any currently in use other than the Big Two.


In summary, to get the original submitted byte strings for PATH_INFO:

Apache/nt/Py2
process REQUEST_URI
Apache/posix/Py2
use PATH_INFO directly
(or process REQUEST_URI)
Apache/nt/Py3
encode PATH_INFO to ISO-8859-1
(or process REQUEST_URI)
Apache/posix/Py3
process REQUEST_URI
IIS/nt/Py2
decode PATH_INFO from mbcs, then encode to UTF-8
FAIL for characters not in current mbcs
FAIL for non-UTF-8 input
IIS/nt/Py3
encode PATH_INFO to UTF-8
FAIL for characters not in current mbcs
FAIL for non-UTF-8 input
wsgiref.simple_server/Py2
use PATH_INFO directly
wsgiref.simple_server/Py3
remains to be seen, but at the moment encode PATH_INFO to UTF-8
FAIL for non-UTF-8 input
cherrypy.wsgiserver/Py2
use PATH_INFO directly
cherrypy.wsgiserver/Py3
remains to be seen, but at the moment encode PATH_INFO to UTF-8
FAIL for non-UTF-8 input

--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-17 Thread Andrew Clover

Mark Hammond wrote:


I don't think Python explicitly converts it - the CRT's ANSI version
of environ is used


Yes, it would be the CRT on Python 2.x. (Python 3.0 on non-NT does a 
conversion always using UTF-8, if I'm reading convertenviron right.)



so the resulting strings should be encoded using the 'mbcs' encoding.
What mangling do you see?


Correct, it's characters unencodable in mbcs that are lost*. mbcs is 
never equivalent to UTF-8 (which would allow us to recover characters on 
IIS) or ISO-8859 (which would allow us to receover characters on 
Apache-for-Windows) so there's always heavy lossage.


(* - replaced with ? or Windows's attempt to substitute something that 
looks vaguely like the original character.)



win32api and ctypes would both let you call the Windows API.


Ah! I had considered the win32 extensions but it's a bit of a 
dependency... I'd forgotten that we get ctypes for free in 2.5.


So we'd be looking at:

ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...)

when CPython 2.5+/NT is detected, right? That increases the number of 
situations in which we can feasibly recover URIs that are valid UTF-8 
sequences (modulo the slash anyway). Doing the actual recovery still 
requires some server-sniffing though.



What is IIS doing wrong here?


It's not wrong as such. There are three reasonable choices for decoding 
header values before putting them in a Unicode environment, and the CGI 
spec, as it knows nothing about Unicode environment variables, fails to 
specify which:


1. ISO-8859-1 (which ensures bytes can be recovered)
2. UTF-8 (since most URIs are effectively UTF-8 today)
3. Configured system codepage (mbcs)

Apache [with mod_cgi or mod_wsgi] decides on (1). IIS tries for (2), 
falling back to (3) on invalid sequences. The text concerning Python 3.0 
in the WSGI Amendments page could be read as blessing Apache's behaviour.


However wsgiref.simple_server currently also goes for (2), although that 
probably can't be considered canonical. I'd be interested to know what 
other WSGI servers do.


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-16 Thread Mark Hammond
 Python decodes the environ to its own copy (wrapped in os.environ) at
 interpreter startup time;

I don't think Python explicitly converts it - the CRT's ANSI version of environ 
is used, so the resulting strings should be encoded using the 'mbcs' encoding.  
What mangling do you see?

 there's no way to query the real ‘live’
 environment that I know of. It'd require a C extension.

win32api and ctypes would both let you call the Windows API.

 What I'm saying is that neither Apache's nor IIS's behaviour can be
 considered clearly correct or wrong at this point, and there is no way
 a WSGI adapter living underneath them *can* fix up the differences.

What is IIS doing wrong here?  IIUC, ISAPI treats everything as bytes, so it is 
more likely to be the higher-level layers built on ISAPI (eg, ASP) which 
assume encodings.

Apologies if you have already answered any of these - I haven’t been following 
that closely...

Cheers,

Mark

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.


See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode. In other words, the values will be 
unicode, but that will simply be a lie.


Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.


Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.


I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.


I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.


An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.


And of course we get support through mod_cgi and mod_wsgi automatically, 
so Graham doesn't have to do anything. :-)


Graham Dumpleton wrote:


I can't really remember what the outcome of the discussion was.


Not too much outcome really, unfortunately! You concluded:


there possibly still is an open question there on how
encoding of non ascii characters works in practice. We just need to
do some actual tests to see what happens and whether there is a problem. 


...to which the answer is — judging by the results posted — probably 
“yes”, I'm afraid!


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Ian Bicking

Andrew Clover wrote:

Ian Bicking wrote:

As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.


See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.


This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?) -- it's mostly 
irrelevant to WSGI itself.


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode. In other words, the values will be 
unicode, but that will simply be a lie.


Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)


As far as I know, PJE just made the suggestion about Latin-1, I don't 
know if anything has actually been done in wsgiref or elsewhere to 
implement that.  Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the 
WSGI spec itself.


Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.


This will presumably require hacks that might be system-dependent. 
Probably the current CGI adapter will just have to be a bit more 
complicated.  Also, if Python is utf8-decoding the environment, we'll 
just have to shortcut that entirely, as you can't just undo utf8.  I 
assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.


I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.


The encoding of the operating system (which presumably informs the 
encoding of os.environ) has nothing to do with the encoding of the web 
application.  For the CGI adapter we simply need to find a way to ignore 
the system encoding.


I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.


Unfortunately REQUEST_URI doesn't map directly to SCRIPT_NAME/PATH_INFO. 
 I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on), 
moving from the two keys to a single REQUEST_URI is not feasible.


It's not that trivial to figure out where in REQUEST_URI the 
SCRIPT_NAME/PATH_INFO boundary really is, as there's many ways the 
unencoded values could be encoded.  I guess you'd probably count 
segments, try to catch %2f (where the segments won't match up), and then 
double check that the decoded REQUEST_URI matches SCRIPT_NAME+PATH_INFO.


An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.


I use the distinction between SCRIPT_NAME and PATH_INFO extensively. 
And frankly IIS is probably less relevant to most developers than CGI. 
Anyway, any of these bugs are things that need to be fixed in the WSGI 
adapter, we must not let them propagate into the specification or 
applications.  So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.


And of course we get support through mod_cgi and mod_wsgi automatically, 
so 

Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?)


Python decodes the environ to its own copy (wrapped in os.environ) at 
interpreter startup time; there's no way to query the real ‘live’ 
environment that I know of. It'd require a C extension.


Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.


I know Graham has done some work on mod_wsgi for 3.0, but no, I don't 
know anyone using it in anger.


Is it worth submitting patches to simple_server to make it run on 3.0? 
Is it too late to include at this stage anyway? Shipping 3.0 with a 
non-functional wsgiref is a bit embarrassing.


I assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.


There is not, and this appears to be deliberate.

I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on),

moving from the two keys to a single REQUEST_URI is not feasible.


That's certainly a possibility, but I feel it's easier to hitch a ride 
on the existing header, which despite being non-standard is still quite 
widely used.



I guess you'd probably count segments, try to catch %2f (where the
segments won't match up), and then double check that the decoded
REQUEST_URI matches SCRIPT_NAME+PATH_INFO.


I'm currently testing with just the segment counting. It's only 
necessary that the segments from SCRIPT_NAME are matched and stripped, 
and those are extremely unlikely to contain ‘%2F’ because:


  - there aren't many filesystems that can accept ‘/’ as a filename
character. RISC OS is the only one I can think of, and it by
convention swaps ‘/’ and ‘.’ to compensate as it is, so even
there you couldn't use ‘%2F’;
  - there aren't many webservers that can map a file or alias to a
path containing ‘%2F’;
  - no-one wants to mount a webapp alias at such a weird name — it's
only in the section corresponding to PATH_INFO that ‘%2F’ might
ever be of use in practice.

In the worst case, many applications already know and can strip the URL 
at which they're mounted, but unless there's a legitimate ‘%2F’ in their 
SCRIPT_NAME it doesn't actually matter.


frankly IIS is probably less relevant to most developers than CGI. 


Er... really?

You and I may not favour it, but it's ≈35% of the world out there, not 
something we can afford to ignore IMO.


So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.


What I'm saying is that neither Apache's nor IIS's behaviour can be 
considered clearly correct or wrong at this point, and there is no way a 
WSGI adapter living underneath them *can* fix up the differences.


(There is an problem with PATH_INFO that a WSGI adapter *could* clear 
up, which is that IIS makes PATH_INFO the entire path including 
SCRIPT_NAME. I'm not sure whether it's worth fixing that up in the 
adapter layer though... it's possible some frameworks are already 
dealing with it, and might even be relying on it!)


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Andrew Clover
It would be lovely if we could allow WSGI applications to reliably 
accept Unicode paths.


That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's, 
without requiring URL-rewriting magic. (Which is so highly 
server-specific, potentially unavailable to non-admin webmasters, and 
makes WSGI app deployment more difficult than it already is.)



If we could reliably read the bytes the browser sends to us in the GET 
request that would be great, we could just decode those and be done with 
it. Unfortunately, that's not reliable, because:


1. thanks to an old wart in the CGI specification, %XX hex escapes are 
decoded before the character is put into the PATH_INFO environment variable;


2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish a 
path-separator slash from an encoded %2F; a long-known problem but not 
one that greatly affects most people.


But combined with (2) that means some other component must choose how to 
decode the bytes into Unicode characters. No standard currently 
specifies what encoding to use, it is not typically configuarable, and 
it's certainly not within reach of the WSGI application. My assumption 
is that most applications will want to end up with UTF-8-encoded URLs; 
other choices are certainly possible but as we move towards IRI they 
become less likely.



This situation previously affected only Windows users, because NT 
environment variables are native Unicode. However, Python 3.0 specifies 
all environment variable access is through a Unicode wrapper, and gives 
no way to control how that automatic decoding is done, leaving everyone 
in the same boat.


WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ 
should be decoded from the headers using HTTP standard encodings (i.e. 
latin-1 + RFC 2047), but unfortunately this doesn't quite work:


1. for many existing environments the decoding-from-headers charset is 
out of reach of the WSGI server/layer and may well not be ISO-8859-1. 
Even wsgiref doesn't currently use 8859-1 (see below).


2. RFC2047 is not applicable to HTTP headers, which are not really 
822-family headers even though they look just like them. The sub-headers 
in eg. a multipart/form-data chunk *are* (probably) proper 822 headers 
so RFC2047 could apply, but those headers are already dealt with by the 
application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to 
RFC2047 as an encoding mechanism for TEXT and quoted-string, but this 
makes no sense as 2047 itself requires embedding in atom-based parsing 
sequences which those productions are not (quoted-strings are explicitly 
disallowed by 2047 itself). In any case no existing browser attempts to 
support RFC2047 encoding rules for any possible interpretation of what 
2616 might mean.



Something like Luís Bruno's ORIGINAL_PATH_INFO proposal 
(http://mail.python.org/pipermail/web-sig/2008-January/003124.html) 
would be worth looking at for this IMO. It may be of questionable 
usefulness if the only character affected is the slash, but it also 
happens to solve the Unicode problem. Obviously whatever it was called 
it would have to be an optional additional value in the WSGI environ, as 
pure CGI servers wouldn't be able to supply it. Conceivably it might 
also be possible to have a standardised mod_rewrite rule to make the 
variable also available to Apache CGI scripts, but still this is far 
from global availability.


In the meantime I've been looking at how various combinations of servers 
deal with this issue, and in what circumstances an application or 
middleware can safely recover all possible Unicode input. 'Apache' 
refers to the (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 
'IIS' refers to IIS with CGI.



*** Apache/Posix/Python2
OK.

No problem here, it's byte-based all the way through.


*** Apache/Posix/Python3:
Dependent on the default encoding.

Apache puts bytes into the envvars but Python takes them out as unicode. 
If the system default encoding happens to be the same as the encoding 
the WSGI application wanted we will be OK. Normally the app will want 
UTF-8; many Linux distributions do use UTF-8 as the default system 
encoding but there are plenty of distros (eg. Debian) and other Unixen 
that do not. In any case we are getting a nasty system dependency at 
deploy time that many webmasters will not be able to resolve.


It is sometimes possible to recover mangled characters despite the wrong 
decoding having been applied. For example if the system encoding was 
ISO-8859-1 or another encoding that maps every byte to a unique Unicode 
character, we can encode the Unicode string back to its original bytes, 
and thence apply the decoding we actually wanted! If, on the other hand, 
it's something like ISO-8859-4, where not all high bytes are mapped at 
all, we'll be losing random characters... not good.



*** Apache/NT/Python2
Always unrecoverable data loss.


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Ian Bicking

Andrew Clover wrote:
If we could reliably read the bytes the browser sends to us in the GET 
request that would be great, we could just decode those and be done with 
it. Unfortunately, that's not reliable, because:


1. thanks to an old wart in the CGI specification, %XX hex escapes are 
decoded before the character is put into the PATH_INFO environment 
variable;


I don't see a problem with this?  At least not a problem with respect to 
encoding.  As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.  It doesn't seem 
like there's any distinction between %-encoded characters and plain 
characters in this situation.



2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish a 
path-separator slash from an encoded %2F; a long-known problem but not 
one that greatly affects most people.


But combined with (2) that means some other component must choose how to 
decode the bytes into Unicode characters. No standard currently 
specifies what encoding to use, it is not typically configuarable, and 
it's certainly not within reach of the WSGI application. My assumption 
is that most applications will want to end up with UTF-8-encoded URLs; 
other choices are certainly possible but as we move towards IRI they 
become less likely.



This situation previously affected only Windows users, because NT 
environment variables are native Unicode. However, Python 3.0 specifies 
all environment variable access is through a Unicode wrapper, and gives 
no way to control how that automatic decoding is done, leaving everyone 
in the same boat.


WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ 
should be decoded from the headers using HTTP standard encodings (i.e. 
latin-1 + RFC 2047), but unfortunately this doesn't quite work:


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode.  In other words, the values will be 
unicode, but that will simply be a lie.  So if you know you have UTF8 
paths, you'd do:


path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')

As far as I can tell this is simply to avoid having bytes in the 
environment, even though bytes are an accurate representation and 
unicode is not.


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.  Unfortunately that's not as flexible as bytes, as it 
doesn't make it very easy to sniff out the encoding in 
application-specific ways, or support different encodings in different 
parts of the server (which would be useful if, for instance, you were to 
proxy applications with unknown encodings).  So... maybe that's not the 
most feasible option.  But if it's not, then I'd rather stick with bytes.



--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Graham Dumpleton
FWIW, there was a past discussion on these issues on mod_wsgi list. I
can't really remember what the outcome of the discussion was. The
discussion is at:

  http://groups.google.com/group/modwsgi/browse_frm/thread/2471a1a71620629f

Graham

2008/11/13 Andrew Clover [EMAIL PROTECTED]:
 It would be lovely if we could allow WSGI applications to reliably accept
 Unicode paths.

 That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's,
 without requiring URL-rewriting magic. (Which is so highly server-specific,
 potentially unavailable to non-admin webmasters, and makes WSGI app
 deployment more difficult than it already is.)


 If we could reliably read the bytes the browser sends to us in the GET
 request that would be great, we could just decode those and be done with it.
 Unfortunately, that's not reliable, because:

 1. thanks to an old wart in the CGI specification, %XX hex escapes are
 decoded before the character is put into the PATH_INFO environment variable;

 2. the environment variables may be stored as Unicode.

 (1) on its own gives us the problem of not being able to distinguish a
 path-separator slash from an encoded %2F; a long-known problem but not one
 that greatly affects most people.

 But combined with (2) that means some other component must choose how to
 decode the bytes into Unicode characters. No standard currently specifies
 what encoding to use, it is not typically configuarable, and it's certainly
 not within reach of the WSGI application. My assumption is that most
 applications will want to end up with UTF-8-encoded URLs; other choices are
 certainly possible but as we move towards IRI they become less likely.


 This situation previously affected only Windows users, because NT
 environment variables are native Unicode. However, Python 3.0 specifies all
 environment variable access is through a Unicode wrapper, and gives no way
 to control how that automatic decoding is done, leaving everyone in the same
 boat.

 WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ should
 be decoded from the headers using HTTP standard encodings (i.e. latin-1 +
 RFC 2047), but unfortunately this doesn't quite work:

 1. for many existing environments the decoding-from-headers charset is out
 of reach of the WSGI server/layer and may well not be ISO-8859-1. Even
 wsgiref doesn't currently use 8859-1 (see below).

 2. RFC2047 is not applicable to HTTP headers, which are not really
 822-family headers even though they look just like them. The sub-headers in
 eg. a multipart/form-data chunk *are* (probably) proper 822 headers so
 RFC2047 could apply, but those headers are already dealt with by the
 application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to RFC2047
 as an encoding mechanism for TEXT and quoted-string, but this makes no sense
 as 2047 itself requires embedding in atom-based parsing sequences which
 those productions are not (quoted-strings are explicitly disallowed by 2047
 itself). In any case no existing browser attempts to support RFC2047
 encoding rules for any possible interpretation of what 2616 might mean.


 Something like Luís Bruno's ORIGINAL_PATH_INFO proposal
 (http://mail.python.org/pipermail/web-sig/2008-January/003124.html) would be
 worth looking at for this IMO. It may be of questionable usefulness if the
 only character affected is the slash, but it also happens to solve the
 Unicode problem. Obviously whatever it was called it would have to be an
 optional additional value in the WSGI environ, as pure CGI servers wouldn't
 be able to supply it. Conceivably it might also be possible to have a
 standardised mod_rewrite rule to make the variable also available to Apache
 CGI scripts, but still this is far from global availability.

 In the meantime I've been looking at how various combinations of servers
 deal with this issue, and in what circumstances an application or middleware
 can safely recover all possible Unicode input. 'Apache' refers to the
 (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 'IIS' refers to
 IIS with CGI.


 *** Apache/Posix/Python2
 OK.

 No problem here, it's byte-based all the way through.


 *** Apache/Posix/Python3:
 Dependent on the default encoding.

 Apache puts bytes into the envvars but Python takes them out as unicode. If
 the system default encoding happens to be the same as the encoding the WSGI
 application wanted we will be OK. Normally the app will want UTF-8; many
 Linux distributions do use UTF-8 as the default system encoding but there
 are plenty of distros (eg. Debian) and other Unixen that do not. In any case
 we are getting a nasty system dependency at deploy time that many webmasters
 will not be able to resolve.

 It is sometimes possible to recover mangled characters despite the wrong
 decoding having been applied. For example if the system encoding was
 ISO-8859-1 or another encoding that maps every byte to a unique Unicode
 character, we can encode the Unicode