-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Graham Dumpleton wrote: > 2009/4/2 Graham Dumpleton <graham.dumple...@gmail.com>: >> Is there going to be any simple answer to all of this? :-( > > I am slowly working through what I think I at least need to do for > Apache/mod_wsgi. I'll give a summary of what I have worked out so far > based on the discussions and my own research. > > Just so I have a list of things to check off, I include an example > WSGI environment from a request and make comments about each category > of things from it. > > First off is CGI HTTP variables. > > HTTP_ACCEPT: > 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' > HTTP_ACCEPT_ENCODING: 'gzip, deflate' > HTTP_ACCEPT_LANGUAGE: 'en-us' > HTTP_CONNECTION: 'keep-alive' > HTTP_HOST: 'home.dscpl.com.au' > HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; > en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 > Safari/525.27.1' > > The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is: > > """When running under Python 3, servers MUST provide CGI HTTP > variables as strings, decoded from the headers using HTTP standard > encodings (i.e. latin-1 + RFC 2047)""" > > Which is fair enough and basically what the RFCs say. At the moment I > don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just > need to do that. > > An interesting one here to note is HTTP_HOST. The issue with this one > is what would happen for a unicode host name. For Apache an IDNA > (RFC3490) encoded host name has to be used to identify a site with > unicode host name. That is, one uses the IDNA name for ServerName or > ServerAlias directives. > > When one gets a request one would actually see the IDNA name for > HTTP_HOST and that only uses latin-1 characters. For example: > > HTTP_HOST: 'xn--wgbe9chb01aytce.com' > > These resolve in DNS okay: > > $ nslookup xn--wgbe9chb01aytce.com > Server: 192.168.1.254 > Address: 192.168.1.254#53 > > Non-authoritative answer: > Name: xn--wgbe9chb01aytce.com > Address: 208.78.242.184 > > Using HTTP live headers on Firefox can also confirm that that is what > would be sent: > > Host: xn--wgbe9chb01aytce.com > > My understanding is that if a actual unicode string is given to a > browser, that it should translate it to the IDNA name before use.
That is what the RFCs require, as well as the fact that un-encoded unicode can't be written onto a socket. > Next HTTP header to worry about is HTTP_REFERRER. > > There would be two parts to this, there would be the host name > component and then the path component. > > We already know from above that for unicode host name it should be the > IDNA name. > > For the path component, if the client follows the rules properly, then > if the path uses a non latin-1 encoding, then it should be using RFC > 2047 to indicate this so shouldn't have to do anything different and > use same rule as other HTTP headers. For this header we are actually > in a better situation that for URL in actual HTTP request line which > isn't so specific about encodings. > > GATEWAY_INTERFACE: 'CGI/1.1' > SERVER_PROTOCOL: 'HTTP/1.1' > > Standard stuff which is always going to be latin-1, so encode as that. I think you mean 'decode' here? Unicode strings are encode to get bytes; bytes are decoded to get unicode strings. Also, I don't know of any reason why those values can be anything but ASCII. > REMOTE_ADDR: '192.168.1.5' > REMOTE_PORT: '51378' > SERVER_PORT: '80' > SERVER_ADDR: '192.168.1.5' > > Again, latin-1 is okay. Likewise, these can't be anything but ASCII. > SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l > DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1' > > Again, latin-1 is okay as Apache modules internally can only supply > normal C strings to add stuff to this. > > SERVER_NAME: 'home.dscpl.com.au' > > Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so > can use latin-1 okay. > > SERVER_ADMIN: 'y...@example.com' > > This is set by ServerAdmin directive. Because in Apache configuration > is effectively latin-1, probably can't even define a non latin-1 email > address. For host part, probably IDNA encoded anyway, so restriction > on latin-1 only perhaps pertinent to user part of email address. So, > latin-1 should be okay. > > SERVER_SIGNATURE: '' > > Depending on Apache configuration can be server name and version > information or server admin email address. All latin-1. > > DOCUMENT_ROOT: '/Library/WebServer/Documents' > SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi' > > These are file system paths, and since the Apache Runtime Library used > for Apache 2.X has a define for whether file system supports unicode, > can say: > > #if APR_HAS_UNICODE_FS > charset = "UTF-8"; > #else > charset = "ISO-8859-1"; > #endif I'm not sure that works for arbitrary filesystem configurations: some parts of the tree may be mounted from locations with different encodings. See David Wheeler's analysis for more: http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html > For Apache 1.3, which doesn't have that define AFAIK, might just have > to assume latin-1, but possibly another way of doing it, or Apache 1.3 > might have its own define for it. > > PATH: '/usr/bin:/bin:/usr/sbin:/sbin' > > Presume I can use APR_HAS_UNICODE_FS check again even though it is a > combination of paths. > > REQUEST_METHOD: 'GET' > > Presume they will always use latin-1 for these. RFC 2616, section 5.1.1 defines only ASCII methods; extension methods are 'tokens', which must also be printable ASCII w/o separateros (section 2.2). > All that is now left is the following, which we have already been discussing. > > REQUEST_URI: '/~grahamd/echo.wsgi' > SCRIPT_NAME: '/~grahamd/echo.wsgi' > PATH_INFO: '' > QUERY_STRING: '' > > At least I am happy that except for these four, that there shouldn't > be any issues. > > I'll keep watching what others come up with in respect of these and > see what consensus develops. :-) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJ1Pe1+gerLs4ltQ4RArt6AJ9GMmvjQd6LfH4MSC1yzNUTO6r51ACg3Ocl 3bOgMrQUlFy+ZSehv8gsSLM= =r4vt -----END PGP SIGNATURE----- _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com