On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton < graham.dumple...@gmail.com> wrote:
> 2009/8/12 Ian Bicking <i...@colorstudy.com>: > > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fuman...@aminus.org> > wrote: > >> > >> > 5. When running under Python 3, servers MUST provide CGI HTTP and > >> > server variables as strings. Where such values are sourced from a byte > >> > string, be that a Python byte string or C string, they should be > >> > converted as 'UTF-8'. If a specific web server infrastructure is able > >> > to support different encodings, then the WSGI adapter MAY provide a > >> > way for a user of the WSGI adapter to customise on a global basis, or > >> > on a per value basis what encoding is used, but this is entirely > >> > optional. Note that there is no requirement to deal with RFC 2047. > >> > >> We're passing unicode for almost everything. > >> > >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and > >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom > >> ACTUAL_SERVER_PROTOCOL entries. > >> > >> The original bytes of the Request-URI are stored in REQUEST_URI. > However, > >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a > >> configurable charset, defaulting to UTF-8. If the path cannot be decoded > >> with that charset, ISO-8859-1 is tried. Whichever is successful is > stored at > >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if > >> needed. Our origin server always sets SCRIPT_NAME to '', but if we > populated > >> it, we would make it decoded by the same charset. > > > > My understanding is that PATH_INFO *should* be UTF-8 regardless of what > > encoding a page might be in. At least that's what I got when testing > > Firefox. It might not be valid UTF-8 if it was manually constructed, but > > then there's little reason to think it is valid anything; only the bytes > or > > REQUEST_URI are likely to be an accurate representation. > > As I understood it, PJE was suggesting that wasn't the case. > > For example, what about case where URL appears for target of form POST > and the encoding of that form page wasn't UTF-8. What is the browser > going to send in that case. > > Or is this the sort of case you have tested and qualify as saying if > manually constructed anything could happen? > Correct -- you can write any set of % encodings, and I don't think it even has to be able to validly url-decode (e.g., /foo%zzz will work). It definitely doesn't have to be a valid encoding. However, if you actually include unicode characters, they will always be encoded as UTF-8 (as goes with the IRI standard). This is in a case like <a href="/some page">, the browser will request /some%20page, because it escapes unsafe characters. Similarly if you request <a href="/français"> it will encode that ç in UTF-8, then url-encode it, even if the page itself is ISO-8859-1. Well, at least on Firefox. I used this to test: http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py -- Ian Bicking | http://blog.ianbicking.org | http://topplabs.org/civichacker
_______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com