Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Graham Dumpleton
FWIW, there was a past discussion on these issues on mod_wsgi list. I
can't really remember what the outcome of the discussion was. The
discussion is at:

  http://groups.google.com/group/modwsgi/browse_frm/thread/2471a1a71620629f

Graham

2008/11/13 Andrew Clover <[EMAIL PROTECTED]>:
> It would be lovely if we could allow WSGI applications to reliably accept
> Unicode paths.
>
> That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's,
> without requiring URL-rewriting magic. (Which is so highly server-specific,
> potentially unavailable to non-admin webmasters, and makes WSGI app
> deployment more difficult than it already is.)
>
>
> If we could reliably read the bytes the browser sends to us in the GET
> request that would be great, we could just decode those and be done with it.
> Unfortunately, that's not reliable, because:
>
> 1. thanks to an old wart in the CGI specification, %XX hex escapes are
> decoded before the character is put into the PATH_INFO environment variable;
>
> 2. the environment variables may be stored as Unicode.
>
> (1) on its own gives us the problem of not being able to distinguish a
> path-separator slash from an encoded %2F; a long-known problem but not one
> that greatly affects most people.
>
> But combined with (2) that means some other component must choose how to
> decode the bytes into Unicode characters. No standard currently specifies
> what encoding to use, it is not typically configuarable, and it's certainly
> not within reach of the WSGI application. My assumption is that most
> applications will want to end up with UTF-8-encoded URLs; other choices are
> certainly possible but as we move towards IRI they become less likely.
>
>
> This situation previously affected only Windows users, because NT
> environment variables are native Unicode. However, Python 3.0 specifies all
> environment variable access is through a Unicode wrapper, and gives no way
> to control how that automatic decoding is done, leaving everyone in the same
> boat.
>
> WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ should
> be "decoded from the headers using HTTP standard encodings (i.e. latin-1 +
> RFC 2047)", but unfortunately this doesn't quite work:
>
> 1. for many existing environments the decoding-from-headers charset is out
> of reach of the WSGI server/layer and may well not be ISO-8859-1. Even
> wsgiref doesn't currently use 8859-1 (see below).
>
> 2. RFC2047 is not applicable to HTTP headers, which are not really
> 822-family headers even though they look just like them. The sub-headers in
> eg. a multipart/form-data chunk *are* (probably) proper 822 headers so
> RFC2047 could apply, but those headers are already dealt with by the
> application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to RFC2047
> as an encoding mechanism for TEXT and quoted-string, but this makes no sense
> as 2047 itself requires embedding in atom-based parsing sequences which
> those productions are not (quoted-strings are explicitly disallowed by 2047
> itself). In any case no existing browser attempts to support RFC2047
> encoding rules for any possible interpretation of what 2616 might mean.
>
>
> Something like Luís Bruno's ORIGINAL_PATH_INFO proposal
> (http://mail.python.org/pipermail/web-sig/2008-January/003124.html) would be
> worth looking at for this IMO. It may be of questionable usefulness if the
> only character affected is the slash, but it also happens to solve the
> Unicode problem. Obviously whatever it was called it would have to be an
> optional additional value in the WSGI environ, as pure CGI servers wouldn't
> be able to supply it. Conceivably it might also be possible to have a
> standardised mod_rewrite rule to make the variable also available to Apache
> CGI scripts, but still this is far from global availability.
>
> In the meantime I've been looking at how various combinations of servers
> deal with this issue, and in what circumstances an application or middleware
> can safely recover all possible Unicode input. 'Apache' refers to the
> (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 'IIS' refers to
> IIS with CGI.
>
>
> *** Apache/Posix/Python2
> OK.
>
> No problem here, it's byte-based all the way through.
>
>
> *** Apache/Posix/Python3:
> Dependent on the default encoding.
>
> Apache puts bytes into the envvars but Python takes them out as unicode. If
> the system default encoding happens to be the same as the encoding the WSGI
> application wanted we will be OK. Normally the app will want UTF-8; many
> Linux distributions do use UTF-8 as the default system encoding but there
> are plenty of distros (eg. Debian) and other Unixen that do not. In any case
> we are getting a nasty system dependency at deploy time that many webmasters
> will not be able to resolve.
>
> It is sometimes possible to recover mangled characters despite the wrong
> decoding having been applied. For example if the system encoding was
> ISO-8859-1 or anot

Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Ian Bicking

Andrew Clover wrote:
If we could reliably read the bytes the browser sends to us in the GET 
request that would be great, we could just decode those and be done with 
it. Unfortunately, that's not reliable, because:


1. thanks to an old wart in the CGI specification, %XX hex escapes are 
decoded before the character is put into the PATH_INFO environment 
variable;


I don't see a problem with this?  At least not a problem with respect to 
encoding.  As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.  It doesn't seem 
like there's any distinction between %-encoded characters and plain 
characters in this situation.



2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish a 
path-separator slash from an encoded %2F; a long-known problem but not 
one that greatly affects most people.


But combined with (2) that means some other component must choose how to 
decode the bytes into Unicode characters. No standard currently 
specifies what encoding to use, it is not typically configuarable, and 
it's certainly not within reach of the WSGI application. My assumption 
is that most applications will want to end up with UTF-8-encoded URLs; 
other choices are certainly possible but as we move towards IRI they 
become less likely.



This situation previously affected only Windows users, because NT 
environment variables are native Unicode. However, Python 3.0 specifies 
all environment variable access is through a Unicode wrapper, and gives 
no way to control how that automatic decoding is done, leaving everyone 
in the same boat.


WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ 
should be "decoded from the headers using HTTP standard encodings (i.e. 
latin-1 + RFC 2047)", but unfortunately this doesn't quite work:


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode.  In other words, the values will be 
unicode, but that will simply be a lie.  So if you know you have UTF8 
paths, you'd do:


path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')

As far as I can tell this is simply to avoid having bytes in the 
environment, even though bytes are an accurate representation and 
unicode is not.


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.  Unfortunately that's not as flexible as bytes, as it 
doesn't make it very easy to sniff out the encoding in 
application-specific ways, or support different encodings in different 
parts of the server (which would be useful if, for instance, you were to 
proxy applications with unknown encodings).  So... maybe that's not the 
most feasible option.  But if it's not, then I'd rather stick with bytes.



--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Andrew Clover
It would be lovely if we could allow WSGI applications to reliably 
accept Unicode paths.


That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's, 
without requiring URL-rewriting magic. (Which is so highly 
server-specific, potentially unavailable to non-admin webmasters, and 
makes WSGI app deployment more difficult than it already is.)



If we could reliably read the bytes the browser sends to us in the GET 
request that would be great, we could just decode those and be done with 
it. Unfortunately, that's not reliable, because:


1. thanks to an old wart in the CGI specification, %XX hex escapes are 
decoded before the character is put into the PATH_INFO environment variable;


2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish a 
path-separator slash from an encoded %2F; a long-known problem but not 
one that greatly affects most people.


But combined with (2) that means some other component must choose how to 
decode the bytes into Unicode characters. No standard currently 
specifies what encoding to use, it is not typically configuarable, and 
it's certainly not within reach of the WSGI application. My assumption 
is that most applications will want to end up with UTF-8-encoded URLs; 
other choices are certainly possible but as we move towards IRI they 
become less likely.



This situation previously affected only Windows users, because NT 
environment variables are native Unicode. However, Python 3.0 specifies 
all environment variable access is through a Unicode wrapper, and gives 
no way to control how that automatic decoding is done, leaving everyone 
in the same boat.


WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ 
should be "decoded from the headers using HTTP standard encodings (i.e. 
latin-1 + RFC 2047)", but unfortunately this doesn't quite work:


1. for many existing environments the decoding-from-headers charset is 
out of reach of the WSGI server/layer and may well not be ISO-8859-1. 
Even wsgiref doesn't currently use 8859-1 (see below).


2. RFC2047 is not applicable to HTTP headers, which are not really 
822-family headers even though they look just like them. The sub-headers 
in eg. a multipart/form-data chunk *are* (probably) proper 822 headers 
so RFC2047 could apply, but those headers are already dealt with by the 
application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to 
RFC2047 as an encoding mechanism for TEXT and quoted-string, but this 
makes no sense as 2047 itself requires embedding in atom-based parsing 
sequences which those productions are not (quoted-strings are explicitly 
disallowed by 2047 itself). In any case no existing browser attempts to 
support RFC2047 encoding rules for any possible interpretation of what 
2616 might mean.



Something like Luís Bruno's ORIGINAL_PATH_INFO proposal 
(http://mail.python.org/pipermail/web-sig/2008-January/003124.html) 
would be worth looking at for this IMO. It may be of questionable 
usefulness if the only character affected is the slash, but it also 
happens to solve the Unicode problem. Obviously whatever it was called 
it would have to be an optional additional value in the WSGI environ, as 
pure CGI servers wouldn't be able to supply it. Conceivably it might 
also be possible to have a standardised mod_rewrite rule to make the 
variable also available to Apache CGI scripts, but still this is far 
from global availability.


In the meantime I've been looking at how various combinations of servers 
deal with this issue, and in what circumstances an application or 
middleware can safely recover all possible Unicode input. 'Apache' 
refers to the (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 
'IIS' refers to IIS with CGI.



*** Apache/Posix/Python2
OK.

No problem here, it's byte-based all the way through.


*** Apache/Posix/Python3:
Dependent on the default encoding.

Apache puts bytes into the envvars but Python takes them out as unicode. 
If the system default encoding happens to be the same as the encoding 
the WSGI application wanted we will be OK. Normally the app will want 
UTF-8; many Linux distributions do use UTF-8 as the default system 
encoding but there are plenty of distros (eg. Debian) and other Unixen 
that do not. In any case we are getting a nasty system dependency at 
deploy time that many webmasters will not be able to resolve.


It is sometimes possible to recover mangled characters despite the wrong 
decoding having been applied. For example if the system encoding was 
ISO-8859-1 or another encoding that maps every byte to a unique Unicode 
character, we can encode the Unicode string back to its original bytes, 
and thence apply the decoding we actually wanted! If, on the other hand, 
it's something like ISO-8859-4, where not all high bytes are mapped at 
all, we'll be losing random characters... not good.



*** Apache/NT/Python2
Always unrecoverable data loss.