Re: [Web-SIG] resources for porting wsgi apps from python 2 to 3
On 01/10/12 18:07, chris.d...@gmail.com wrote: * Use bytes or str for environ keys? * Use bytes or str for environ values? str, decoded from the request bytes using ISO-8859-1. * Are all environ values created equal or would, for example, QUERY_STRING's value (prior to any parameter to decoding) be handled differently from HTTP_COOKIE All environ values are created equal (other than the CGI-mandated odd decoding behaviour of SCRIPT_NAME and PATH_INFO). * If str, I see that ISO-8859-1 is the assumed encoding. How much hurt occurs in the world if I just assume utf-8 when decoding to str[4]? Immediately, all non-ASCII characters in the path would be interpreted incorrectly. The more general hurt to the world would be that we would continue the sad pre-PEP situation where every web server handles non-ASCII characters differently, and so no WSGI application can reliably use Unicode in path segments. There is little impact to any header other than the path, because non-ASCII characters almost never appear in them. The query string remains %-encoded so any non-ASCII characters are safe. The other places users can put non-ASCII characters are in cookies and HTTP Basic Authorisation headers, but browser support here is so variable/broken that Python's handling would be the least of your worries. [4] Which is what it should have been all along? Not necessarily. Even if you decide that all web apps must use UTF-8 for text encoding, it's valid to have URL-encoded, non-text binary data in a path segment. This would be unrecoverable using straight UTF-8. (They would be recoverable if surrogateescape were used, but PEP has to encompass language versions that don't have surrogateescape, and also it's questionable whether it should be possible to smuggle non-UTF-8 data into strings that applications assume are safe.) Plus header values are less likely to be UTF-8, and HTTP specifies that they're ISO-8859-1 (even if that is not well-observed by browsers). Ideally, the interfaces should all be bytes, because HTTP is defined in terms of bytes. But that plays poorly with Python 3's default Unicode strs (for environ et al). So ISO-8859-1 was chosen as a str interface for which the original bytes can at least be recovered. * Should start_response only accept bytes (and error if not), or should it also accept str and encode appropriately? status and response_headers are, like the request headers, native str (to be ISO-8859-1 encoded). It's only the HTTP entity body that is always bytestring. * Should the returned iterable be rejected or encoded if not bytes? I don't think it's specified by the PEP, but wsgiref looks like it'll chuck TypeError when it tries to write str to the buffer/socket. cheers, -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobi...@gmail.com ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] wsgiref 0.2 dev in svn w/PEP 3333 support
On 10/06/2010 07:21 PM, P.J. Eby wrote: How would these relate to the Python 3.2 release? Can you make 3.x and 2.x versions? Yes, I have separate fixup code paths for 2.x and 3.x. 3.x faces the reverse situation to that previously described, in that os.environ is accurate on Windows but needs reverse-decoding on POSIX. Currently I use utf-8 and surrogateescape, but for Python 3.2 presumably os.environb will be the safer bet. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Most WSGI servers close connections to early.
On 09/22/2010 02:46 PM, Marcel Hellkamp wrote: An application should read all available data from `environ['wsgi.input']` on POST or PUT requests, even if it does not process that data. Otherwise, the client might fail to complete the request and not display the response. Oh, it's worse than that. In practice the application needs to read all available data from the request body before producing output. If you send too much response without reading the whole request body in some environments, you can deadlock. The web server is buffering the input stream for the request body and also the output stream from the app. This needs to be done[1] to avoid sending an HTTP response before the request is complete. If those are limited-size buffers[2] and you fill the output buffer with response without clearing enough of the input buffer that the browser can finish sending the request, you'll be blocking indefinitely on write. [1] possibly unless HTTP pipelining is in effect? not sure, haven't tested. [2] and certainly in IIS they are. The output buffer is 8K IIRC. It's easy to overflow that and get a mysterious non-responsive script because an error happens and spits out a debugging page before the form-reading library has had a chance to consume the input. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] PEP 444 (aka Web3)
On 09/17/2010 04:21 AM, Ian Bicking wrote: Yes, if we get rid of SCRIPT_NAME/PATH_INFO then the problem goes away. For servers without access to the unencoded value, reencoding those values doesn't actually lose any information over what we have now, and avoids any encoding issues. It doesn't lose any information, but it also makes script_name/path_info inherently unreliable. My fear is that if gateways are allowed to create a reconstructed script_name/path_info without clearly signalling they have done so, those values will continue to be unreliable at all times and server authors won't feel the need to get it right since it's broken everywhere anyway: the unhappy status quo. This is why I am continuing to plead for a 'script_name/path_info are authoritative' flag in environ that applications can use to detect situations where it is safe to go ahead and rely on them. I want to say Unicode paths are supported if your server/gateway does, not Unicode paths might sometimes work, depending on how you configure your server and application. It is not just CGI that is affected here! IIS does not provide the original undecoded path at all, even through ISAPI. At the moment I am using a 'fixPathInfo' method in my form-reading layer to try to compensate as much as possible for the problems of CGI: - on Python 2 on Windows, re-read the environment variables using ctypes if available, to avoid the mangling caused by reading os.environ using mbcs. (This didn't used to work, as old versions of IIS deliberately mbcs-filtered values before putting them in the environment, but it does now.) - on Python 3 on POSIX, re-read the environment variables using environb if available. Otherwise try to reverse the faulty decoding of environ using surrogateescapes, where available. - on Windows, encode the Unicode environment to bytes using ISO-8859-1 if the server is Apache, or UTF-8 is the server is IIS. (IIS tries to decode path bytes using UTF-8, falling back to mbcs where the input is not valid UTF-8. Unfortunately there is no way to tell this has happened.) - when server is Microsoft-IIS, remove the erroneously repeated SCRIPT_NAME components from the front of PATH_INFO. (This is a long-standing bug that can be configured away using the allowPathInfo/AllowPathInfoForScriptMappings configs, but no- one does as it breaks ASP.) However, the form layer is not really the right place to be doing these hacks. It would be better done in the stdlib CGI handler. Servers with REQUEST_URI can at least attempt to reconstruct the encoded values. This is slightly unsafe. It's something an application might want to do (or at least provide as an option), but a gateway probably couldn't get away with it for the general case because REQUEST_URI doesn't reflect the redirections done by a RewriteRule or an ErrorDocument. Cookie is also the one header that can't be safely folded. There are others, eg. Authorization. Anyway: folding doesn't happen in the HTTP world. It can be forgotten about. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] PEP 444 (aka Web3)
On 09/17/2010 02:03 PM, Armin Ronacher wrote: In case we change the spec as Ian mentioned above, I am all for a wsgi.guessed_encoding = True flag or something like that. Yes, I'd like to see that. I believe going with *only* a raw-or-reconstructed path_info, rather than having both path_info and PATH_INFO, is probably best, for the middleware-dupication reasons PJE mentioned. A more in-depth possibility might be: wsgi.path_accuracy = 0: script_name/path_info have been crudely reconstructed from SCRIPT_NAME/PATH_INFO from an unknown source. Beware! If there is to be backwards compatibility with WSGI1, this would be seen as the 'default value' given a missing path_accuracy. 1: script_name/path_info have been reconstructed, but it is known that path_info is accurate, other than %2F and non-ASCII issues. That is, it's known that the path doesn't come from IIS's broken PATH_INFO, or the IIS error has been detected and compensated for. 2: script_name/path_info have been reconstructed using known-good encodings for the env. The only way in which they may differ from the original request path is that a slash might originally have been a %2F. (This is good enough for the vast majority of applications.) 3: script_name/path_info come directly from the request path without any intervening mangling. Unless I am mistaken, the same is true for CGI scripts running on Apache2 on Windows. Yes, it's true of *all* CGI scripts, but also for non-CGI scripts on IIS. I did some tests a while ago and was pretty sure that Apache2 on Windows did the same. Apache-on-Windows puts the bytes of the decoded path into the environment variables as one code unit per byte: that is, as if encoded by ISO-8859-1. You still have to read the environ using ctypes because mbcs is never ISO-8859-1, but at least the original bytes are recoverable, which isn't the case with IIS. The correct place for these hacks would be the appropriate WSGI/Web3 handler of the webserver. The IIS PATH_INFO-prefix hack would be appropriate to put in an IIS-specific handler; indeed, I believe isapi_wsgi does just that. But the other hacks are specific to CGI. For CGI, there is no 'handler of the webserver', there is only the standard CGI-to-WSGI adapter, so this is the only component it is reasonable to burden with the hacks. Frameworks and libraries further up the stack cannot reliably do the fixups, because they don't know whether the WSGI environ they have been given comes from os.environ or somewhere else, or whether middleware has played with it. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] PEP 444 (aka Web3)
On 09/16/2010 02:05 AM, P.J. Eby wrote: note that the spec's sample CGI implementation does not itself provide the new variables It can't: This is the original URL-encoded value derived from the request URI. If the server cannot provide this value, it must omit it from the environ. A CGI gateway doesn't have access to the original URL-encoded value. middleware must be explicitly written to handle the case where there is duplication. The alternative to duplication would be to allow a gateway to try to 'reconstruct' `path_info` from CGI `PATH_INFO`. If this is done there really needs to be a flag somewhere to say that it has been done, ie. that `/` and non-ASCII characters in the path are unreliable. Otherwise we're just going to end up in the same sorry situation we have today where all sorts of different encodings and corruptions lurk inside PATH_INFO and apps simply cannot rely on it. chr...@plope.com wrote: The most sensible thing to me would be to put it in PATH_INFO. Please don't have a field with encoded semantics that re-uses the name of a field that has always had decoded semantics. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI for Python 3
On 07/14/2010 06:43 AM, Ian Bicking wrote: There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO, and HTTP_COOKIE. (And of those, PATH_INFO is the only one that really matters, in that no-one really uses non-ASCII script filenames, and non-ASCII characters in Cookie/Set-Cookie are still handled so differently/brokenly across browsers that you can't rely on them at all.) * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them exclusively with encoded versions For compatibility with existing apps, how about keeping the existing SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying that the new 'raw' versions (whatever they are called) are added only if they really are raw, not reconstructed. Then existing scripts that don't care about non-ASCII and slashes can carry on as before, and for apps that do care about them, they'll be able to be *sure* the input is correct. Or they can fall back to PATH_INFO when not present, and avoid producing these kind of URLs in response. (Or an app might have enough special knowledge to try other fallback mechanisms when the raw versions are unavailable, such as REQUEST_URI or Windows ctypes envvar hacking. But if the server/gateway has good raw paths it shouldn't bother use these.) -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI for Python 3
On 07/16/2010 12:07 PM, Graham Dumpleton wrote: If you do that, one has to ask the question, given it is more convention than anything, why it isn't just a x-wsgiorg extension specification Yes, fine by me either way. I just want to be able to say this application can use Unicode paths when run on a server/gateway that supports standardised feature X, rather than the current mess of you can have Unicode paths if you use one of the dozen different server-and-platform combinations we've specifically coded workarounds for. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Draft PEP: WSGI 1.1
Dirkjan Ochtman wrote: 1. The application is passed an instance of a Python dictionary containing what is referred to as the WSGI environment. All keys in this dictionary are native strings. For CGI variables, all names are going to be ISO-8859-1 and so where native strings are unicode strings, that encoding is used for the names of CGI variables. Perhaps explain where those ISO-8859-1 bytes might come from: ...are native strings. Where native strings are Unicode, any keys derived from byte-oriented sources (such as custom headers in the HTTP request reflected in the CGI environment variables) should be decoded using the ISO-8859-1 encoding. 3. For the CGI variables contained in the WSGI environment, the values of the variables are native strings. Where native strings are unicode strings, ISO-8859-1 encoding would be used such that the original character data is preserved and as necessary the unicode string can be converted back to bytes and thence decoded to unicode again using a different encoding. Good. The only problem that remains with this is that in certain environments (notably: all IIS use, not just CGI) a WSGI gateway cannot fully comply with this requirement. a. disallow environments that cannot be sure they are preserving the original byte data from declaring that they support wsgi.version 1.1? b. add an extra wsgi.something flag for a WSGI server to add, to specify that it is sure that the original bytes have been preserved? (ie. so wsgiref's CGI handler would have to declare it wasn't sure when running under Windows.) c. just let WSGI gateways silently ignore the ISO-8859-1 requirement if they can't honour it and let the application spend its time trying to unravel the mess (status quo). (Can wsgiref be fixed to use ISO-8859-1 in time for Python 3.2?) 7. The iterable returned by the application and from which response content is derived, should yield byte strings. Where native strings are unicode strings, the native string type can also be returned in which case it would be encoded as ISO-8859-1. 8. The value passed to the 'write()' callback returned by 'start_response()' should be a byte string. Where native strings are unicode strings, a native string type can also be supplied, in which case it would be encoded as ISO-8859-1. Weren't we going to only allow US-ASCII for the output? (These threads are always so far apart I can never remember what conclusion we reached... if any.) Whilst ISO-8859-1 is in the HTTP standard for headers, and required to preserve bytes in input, it's much, much less likely that the response body is going to be ISO-8859-1. It could maybe be cp1252, but more likely the author wanted UTF-8. If we must support Unicode strings for response body output at all, I'd prefer to be conservative here and spit a UnicodeEncodeError straight away, rather than quietly mangle characters U+0080 to U+00FF. Manlio Perillo wrote: The run_with_cgi sample function should be changed, since it probably does not work correctly, on Python 3.x. Yes, the 'URL Reconstruction' fragment will be wrong too, since it uses urllib.quote() to encode the path part. quote() defaults to UTF-8 rather than the ISO-8859-1 WSGI 1.1 requires. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] CGI WSGI and Unicode
Manlio Perillo wrote: In a CGI application, HTTP headers are Unicode strings, and are decoded using system default encoding. In a future WSGI application, HTTP headers are Unicode strings, and are decoded using latin-1 encoding. Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the decode stage caused by reading environ using the default encoding. At least this is now reliably possible thanks to surrogateescape. PATH_INFO is the only really important HTTP-related environment variable for Unicode. Potentially SCRIPT_NAME could also be significant in relation to PATH_INFO. The HTTP headers don't massively matter because there are almost never any non-ASCII characters in them. Previously the job of undoing an unwanted decode step was dumped on whatever read the PATH_INFO; usually a routing component, which would have to make guesses with typically poor results. The CGI adapter is in a much better place to do it, being closer to the server. The problem is that not all browsers use latin-1. Not WSGI's problem. WSGI will deliver bytes encoded into Unicode strings, not ready-to-use Unicode strings. It is up to the application to decide how they want to handle those bytes; maybe they want Latin-1 and can do nothing, maybe they want to recode to UTF-8, maybe something else completely. No solution satisfies every app so there is always going to have to be a recode step somewhere. An application that doesn't want to think about this will use a framework that does it for them. What about HTTP_COOKIE? For what it's worth, the choice of Latin-1 here results in the 'right' Unicode string for more browsers than any other potential encoding. In any case as previously discussed, non-ASCII cookies are already totally broken everywhere and hence used by no-one. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
Manlio Perillo wrote: However what about URI (that is, for PATH_INFO and the like)? For URI (if I remember correctly) the suggested encoding is UTF-8, so URLS should be decoded using url.decode('utf-8', 'surrogateescape') Is this correct? The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be trivially extracted. This is consistent with the other headers and would be my preferred approach. Python 3.1's wsgiref.simple_server, on the other hand, blindly uses urllib.unquote, which defaults to UTF-8 without surrogateescape, mangling any non-UTF-8 input. I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding is blessed. But *something* needs to be blessed. An encoding, an alternative undecoded path_info, both, something else... just *something*. Let's consider the `wsgiref.util.application_uri` function There is a potential problem, here, with the quote function. Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 3.0, but still broken. Until we can come to a Pronouncement on what WSGI *is* in Python 3, it is meaningless anyway. Cookie data SHOULD be transparent to the server/gateway; however WSGI is going to assume that data is encoded in latin-1. Yeah. This is no big deal because non-ASCII characters in cookies are already broken everywhere(*). Given this and other limitations on what characters can go in cookies, they are habitually encoded using ad-hoc mechanisms handled by the application (typically a round of URL-encoding). *: in particular: - Opera and Chrome send non-ASCII cookie characters in UTF-8. - IE encodes using the system codepage (which can never be UTF-8), mangling any characters that don't fit in the codepage through the traditional Windows 'similar replacement character' scheme. - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1 gets through but everything else is mangled) - Safari refuses to send any cookie containing non-ASCII characters. I don't know what the HTTP/Cookie spec says about this. The traditional interpretation of RFC2616 is that headers are ISO-8859-1. You will notice that no browser correctly follows this. ...sigh. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] HTTP headers encoding
Manlio Perillo wrote: I have written a simple WSGI application that asks authentication credentials Ho ho! This is another area that is Completely Broken Everywhere. It's actually a similar situation to the cookies: - Opera and Chrome send non-ASCII cookie characters in UTF-8. - IE encodes using the system codepage (which can never be UTF-8), mangling any characters that don't fit in the codepage through the traditional Windows 'similar replacement character' scheme. - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1 gets through but everything else is mangled) - Safari uses ISO-8859-1, and refuses to send any cookie containing characters outside the 8859-1 repertoire. - Konqueror uses ISO-8859-1, and replaces any non-8859-1 character with a question mark. The HTTP standard has nothing to say about the encoding in use *inside* the base64-encoded Authorization byte-string token. It's anyone's guess, and every browser has guessed differently. (Safari here is at least slightly better than its behaviour with the cookies.) (and I suspect that [IE] always use this encoding, instead of iso-8859-1). It will certainly never send ISO-8859-1, but what it does send is locale dependent. Type an e-acute in your username on a Western machine and it'll send one byte sequence; type the same thing on an Eastern European Windows install and you'll get something quite different. Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac' I don't know where \xac come from It's the low byte of UCS-2 codepoint U+20AC (EURO SIGN). Firefox simply discards the top 8 bits of each codepoint. Unfortunately I can not test with IE 7 and 8. The behaviour has not changed. This is really a mess. Isn't it. How is authorization username handled in common WSGI frameworks? No-one supports non-ASCII characters in Authentication. Most web authors simply move to cookies instead. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
Manlio Perillo wrote: Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself specifically denies that an encoded-word can go in a quoted-string. RFC2047 encoded-words are not on-topic in an HTTP header(*); this has been confirmed by newer development work on HTTPbis by Reschke et al. (http://tools.ietf.org/wg/httpbis/). The correct way of escaping header parameters in an RFC*822-family protocol would be RFC2231's complex encoding scheme, but HTTP is explicitly not an 822-family protocol despite sharing many of the same constructs. See http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a strategy for how 2231 should interact with HTTP, but note that for now RFC2231-in-HTTP simply does not exist in any deployed tools. So for now there is basically nothing useful WSGI can do other than provide direct, byte-oriented (even if wrapped in 8859-1 unicode strings) access to headers. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
Graham Dumpleton wrote: Answering my own question, it is actually obvious that it has to be called (1, 0). This is because wsgiref in Python 3.X already calls it (1, 0) and don't have much choice to be in agreement with that. wsgiref.simple_server in Python 3 to date is not something that anyone should worry about being compatible with. It is a 2to3 hack that cannot meaningfully claim to represent wsgi version anything. Careless use of urllib.parse.unquote causes 3.0's simple_server not to work at all, and 3.1's to mangle the path by treating it as UTF-8 instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even mod_cgi via wsgiref.CGIHandler) delivered. Yes, I'm always going on about Unicode paths. I'm fed up of shipping apps with a page-long deployment note about fixing them. It pains me that in so many years both this and What do we do about Python 3? still haven't been addressed. mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would prefer not to see more farcical reverse-progress at this point. For what it's worth my responses on the issues of this thread. But at this point I really just want a BDFL to just come and do it, whatever it is. A new WSGI, whatever the version number, is massively overdue. 1. The 'readline()' function of 'wsgi.input' may optionally take a size hint. Yes. Obviously. Bad practice but unavoidable now. Should have been a 1.0 amendment a long time ago. 2. The 'wsgi.input' must provide an empty string as end of input stream marker. 3. The size argument to 'read()' function of 'wsgi.input' would be optional and if not supplied the function would return all available request content. 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour the Content-Length response header and must only return from the file that amount of content. +0. Seems reasonable but don't massively care. Presumably an application must refuse to run on 1.0 if it requires these behaviours? 5. Any WSGI application or middleware should not return more data than specified by the Content-Length response header if defined. 6. The WSGI adapter must not pass on to the server any data above what the Content-Length response header defines if supplied. Yes. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Future of WSGI
Ian Bicking wrote: The proposal that seemed to work best was to keep the environ as str (i.e., unicode in Python 3), and eliminate the problematic SCRIPT_NAME and PATH_INFO, replacing them with url-encoded values. Ah, OK, if that's where we got to I'm happy with that - as long as the application/framework can tell the difference between (a) old-school WSGI 1.0 decoded PATH_INFO, (b) new verbatim PATH_INFO, and (c) a new verbatim PATH_INFO that has been created from an old PATH_INFO by a WSGI handler unfortunate enough to be running under CGI or IIS, potentially including mangled characters. I would prefer to avoid the latter completely. This could be achieved by giving the new variables a different name and only including them if they're safe (leaving the application to fall back to the old variables where unavailable), or by having a flag to specify they're verbatim and leaving it unset when unmangled verbatim is unavailable. Also I think everyone is okay with removing start_response. +0.5: very much happy to see it gone, but if it causes any more delay to a WSGI update I'm also not unhappy if it stays. My primary concern is that a Python-3-compatible WSGI is available as soon as possible; every long argument in here seems to lead to no resolution. I want to release Python 3 web code, and cannot whilst WSGI remains in flux. Whilst in principle I kind of agree with Malthe that keeping the CGI-derived environ separate from items like wsgi.input would be appropriate, in practice I don't give a stuff about it: the merged dictionary causes no practical problems, and changing it would be an enormous upheaval for all WSGI users. WSGI doesn't need to be pretty, it needs to be widely-compatible. Authors who want pretty can use frameworks, which will be happy to deliver elegant Request and Response objects. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO
Ian wrote: some environments provide only the unquoted path. I think it's not terribly horrible if they fake it by re-quoting the path. If they are faking it, there should IMO be a way to flag that it's faked. Then an application that uses IRIs may choose to a. generate an error, or b. carry on, don't care about %2F and just hope the encodings match, or c. fall back to outputting only ASCII URLs. I also believe you can safely reconstruct the real SCRIPT_NAME/PATH_INFO from REQUEST_URI, which is usually available I wouldn't say 'usually', REQUEST_URI is Apache-specific. I haven't checked other servers to see if they copy it, but IIS certainly doesn't. SCRIPT_NAME/PATH_INFO can differ completely from REQUEST_URI when Apache has done some internal redirection, for example via mod_rewrite or ErrorDocument. It's certainly useful as a fixup possibility (several of my apps optionally use it), but not something that can really be relied upon. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO
Ian Bicking wrote: I propose we switch primarily to native strings: str on both Python 2 and 3. Fine. wsgi.input remains byte-oriented, as does the response app_iter. Good. These both form the original path. It is not URL decoded, so it should be ASCII. Great! BUT. Undecoded script_name/path_info *cannot* be provided by some gateways: primarily, but not only, CGI. Such a gateway can reconstruct what it thinks the undecoded versions should have been, but this is not reliably accurate. I would like a way (eg. a flag) for the gateway or server to specify to the application that script_name/path_info are potentially inaccurate. Then the application can react by avoiding IRI (and %2F, though arguably that should be avoided anyway). This sends different text, but is highly preferable. Yes. All schemes to send non-ASCII in cookies require sending different text; URL-encoding is a common choice of ad-hoc wrapping. I don't think WSGI has to worry too much about explaining this, it's a known hazard of the web in general. It doesn't work in any other environment, so nobody should be expecting it to work in WSGI. What happens if you give unicode text in the response headers that cannot be encoded as Latin1? UnicodeEncodeError. Should some things be unicode on Python 2? Don't think so. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Getting back to WSGI grass roots.
Graham wrote: So, rather than throw away completely the idea of bytes everywhere, and rewrite the WSGI specification, we could instead say that the existing conceptual idea of WSGI 1.0 is still valid, and just build on top of it a translation interface to present that as unicode. I don't think we really need to. Almost nothing in WSGI itself actually touches Unicode. HTTP headers may in theory be ISO-8859-1 (and certainly should be handled as such), but in the real world they are exclusively ASCII (anything else breaks browsers). SCRIPT_NAME/PATH_INFO is the only part of the spec that potentially needs more than ASCII, and even then the majority of apps don't put any Unicode characters in those (especially SCRIPT_NAME). I don't think it's worth adding the complication of a two-layer interface just for this one case. If we can hack around SCRIPT_NAME/PATH_INFO separately as per the other thread there's no longer any need for anything but ASCII, so WSGI's strings can be bytes or unicode depending on your taste/Python-version, without it hurting anyone. The important job of mapping * query parameters, * POSTed request bodies, and * response bodies between bytes and unicode remains firmly in the application/framework's area of concern and needs no assistance from WSGI. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Alan Kennedy wrote: Why can't we just do the same as the java servlet spec? Because Servlet is a walking, stinking demonstration of how *not* to handle encodings. Every servlet container has its own different method of selecting input character sets, and the default encoding is almost never right. Most deployed JSP applications out there are using the wrong charset and do the wrong thing with any non-ASCII character. This is not something to aim for. Pushing the choice of encodings out to a 'deployment issue' where the application doesn't get to decide is a Wrong Thing. I hate dealing with this nonsense in Java and I do not want the same approach to become common in Python. I see this as being the same as Graham's suggested approach of a per-server configurable charset This is absolutely the opposite of what I want as an application author. I want to hand out my WSGI application that uses UTF-8 and know that wherever it is deployed the non-ASCII characters will go through without getting mangled. The application (perhaps via a framework it is using) is the party that is in the best place to know what character encoding it wants to deal with. Give the application a consistent way to handle that encoding itself, because the poor sod deploying it isn't going to know any better. Those frameworks obviously have solved all of the problems of decoding incoming request components from miscellaneous unknown character sets into unicode, with out any mistakes Er, no. That's the point. It cannot currently be done in all deployment environments. When they're not running via their own built-in servers, the frameworks have to do the same as the rest of us: guess. That guess may not be as troublesome as it is in Java (mainly because for us it doesn't affect QUERY_STRING parameters), but it's still not reliable, which is why today you can't have a WSGI application with pretty non-ASCII URLs that will deploy consistently. I want this fixed. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Graham wrote: Armin has fast asleep now, so my shift. Heh. It's a multiple-man job keeping up with this monster thread! The URLs don't break. Not in themselves. Just the language of the PEP implies that to fix them up would contravene the spec: The application MUST use [the encoding guess for PATH_INFO] to decode the ``'QUERY_STRING'`` as well. This isn't appropriate even as a SHOULD: the guessed encoding for PATH_INFO is very likely to be wrong, in particular for cases where the path was purely ASCII. The application (or a library/framework acting on its behalf) should be allowed to decode QUERY_STRING using whatever encoding it is expecting. Disallowing using anything other than utf-8 (and iso-8859-1 in a very unreliable way) makes it impossible to have queries in any other encoding at all and still comply with the spec, which is undesirable. If this sentence is removed, and `wsgi.uri_encoding` is guaranteed to be one of: a. definitive and reliable, or b. missing/None I'm pretty much happy. What I don't want is that half the future-WSGI servers/gateways decide they have to provide *some* value for `wsgi.uri_encoding` even if they're not quite sure if it's the right one. Then we're back to square one. if it is known that an application or some subset of URLs will always be receiving a request as non UTF-8, then it should employ code in those cases to always transcode it to the required encoding. Yep, agreed. I think the PEP should clarify that; at the moment it is saying that a transcode is something you should only do for the iso-8859-1 case, but if you actually followed that advice you'd get highly inconsistent results. Perhaps we're at cross-purposes as to what exactly consistutes 'middleware'... The other fallback is that a specific WSGI server could elect to provide an option to not use 'UTF-8' as the first choice for decoding I really, *really* hope this does not happen. That just brings us more deployment heartaches. Whether surrogateescape gives a better solution I have no idea at this point Yeah... I'm highly suspicious of surrogateescape in a web context and personally my code will be deliberately filtering all such characters out. I can see it being a possible way to smuggle unwanted sequences (such as overlongs) through filters, potentially causing endless security problems. But we'll see... -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Request for Comments on upcoming WSGI Changes
Armin Ronacher wrote: The middleware can never know. It's much more likely than to know than the server though! WSGI will demand UTF-8 URLs and only provide iso-XXX support for backwards compatibility. It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm as much an advocate of UTF-8 for everything everywhere! as anyone else, but unfortunately today there are still dark places where you need non-UTF-8 URLs. Incidentally, if wsgi.uri_encoding is going to be the way to signal that the server has decoded bytes to characters using a known encoding, it should be stressed that this should only be set when that encoding is certain. That is, wsgi.uri_encoding should be omitted (or None?) in cases where another party has already decoded (and maybe mangled) the bytes using an unknown encoding. In particular, CGI. (In the case of Windows CGI the server will have decoded URI bytes into Unicode characters, using a charset which it is impossible to find out. In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF sequence, otherwise it's the system codepage. This problem affects the non-CGI implementation isapi_wsgi, too. Then the variables are read as environment variables, which for Python 2 means another encode/decode step on Windows using the system codepage, mangling non-codepage characters. Python 3 has the opposite problem reading byte envvars using UTF-8, which won't be how Apache put them there.) If wsgi.encoding is obligatory then in reality it will often be wrong, leaving us in the same pathetic predicament as with WSGI 1.0, where non-ASCII URIs don't work reliably at all. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets
ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...) Hmm... it turns out: no. IIS appears to be mangling characters that are not in mbcs even *before* it puts the decoded value into the envvars. The same is true with isapi_wsgi, which is the only other WSGI adapter I know of for IIS. This gets the same mangled byte string from GetServerVariable as Python gets from the envvars, so it looks like this is a mistake IIS is making further up before it even hits the CGI handler. Maybe someone more familiar with ISAPI knows a better way to read PATH_INFO than GetServerVariable, but I can't see anything promising in MSDN. So it would seem to be impossible at the moment to have Unicode paths work under IIS at all. The ctypes approach could rescue bytes for the Apache/nt/Py2 combination (perhaps also from libc.getenv for Apache/posix/Py3), but then Apache already gives us REQUEST_URI which is a much easier workaround. There might be CGI servers for Windows where ctypes could serve some purpose, but I can't think of any currently in use other than the Big Two. In summary, to get the original submitted byte strings for PATH_INFO: Apache/nt/Py2 process REQUEST_URI Apache/posix/Py2 use PATH_INFO directly (or process REQUEST_URI) Apache/nt/Py3 encode PATH_INFO to ISO-8859-1 (or process REQUEST_URI) Apache/posix/Py3 process REQUEST_URI IIS/nt/Py2 decode PATH_INFO from mbcs, then encode to UTF-8 FAIL for characters not in current mbcs FAIL for non-UTF-8 input IIS/nt/Py3 encode PATH_INFO to UTF-8 FAIL for characters not in current mbcs FAIL for non-UTF-8 input wsgiref.simple_server/Py2 use PATH_INFO directly wsgiref.simple_server/Py3 remains to be seen, but at the moment encode PATH_INFO to UTF-8 FAIL for non-UTF-8 input cherrypy.wsgiserver/Py2 use PATH_INFO directly cherrypy.wsgiserver/Py3 remains to be seen, but at the moment encode PATH_INFO to UTF-8 FAIL for non-UTF-8 input -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets
Mark Hammond wrote: I don't think Python explicitly converts it - the CRT's ANSI version of environ is used Yes, it would be the CRT on Python 2.x. (Python 3.0 on non-NT does a conversion always using UTF-8, if I'm reading convertenviron right.) so the resulting strings should be encoded using the 'mbcs' encoding. What mangling do you see? Correct, it's characters unencodable in mbcs that are lost*. mbcs is never equivalent to UTF-8 (which would allow us to recover characters on IIS) or ISO-8859 (which would allow us to receover characters on Apache-for-Windows) so there's always heavy lossage. (* - replaced with ? or Windows's attempt to substitute something that looks vaguely like the original character.) win32api and ctypes would both let you call the Windows API. Ah! I had considered the win32 extensions but it's a bit of a dependency... I'd forgotten that we get ctypes for free in 2.5. So we'd be looking at: ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...) when CPython 2.5+/NT is detected, right? That increases the number of situations in which we can feasibly recover URIs that are valid UTF-8 sequences (modulo the slash anyway). Doing the actual recovery still requires some server-sniffing though. What is IIS doing wrong here? It's not wrong as such. There are three reasonable choices for decoding header values before putting them in a Unicode environment, and the CGI spec, as it knows nothing about Unicode environment variables, fails to specify which: 1. ISO-8859-1 (which ensures bytes can be recovered) 2. UTF-8 (since most URIs are effectively UTF-8 today) 3. Configured system codepage (mbcs) Apache [with mod_cgi or mod_wsgi] decides on (1). IIS tries for (2), falling back to (3) on invalid sequences. The text concerning Python 3.0 in the WSGI Amendments page could be read as blessing Apache's behaviour. However wsgiref.simple_server currently also goes for (2), although that probably can't be considered canonical. I'd be interested to know what other WSGI servers do. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] Revising environ['wsgi.input'].readline in the WSGI specification
Ian Bicking wrote: To resolve this, let's just not pass it over this time? +1 -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets
Ian Bicking wrote: As it is (in Python 2), you should do something like environ['PATH_INFO'].decode('utf8') and it should work. See the test cases in my original post: this doesn't work universally. On WinNT platforms PATH_INFO has already gone through a decode/encode cycle which almost always irretrievably mangles the value. My understanding of this suggestion is that latin-1 is a way of representing bytes as unicode. In other words, the values will be unicode, but that will simply be a lie. Yes, that would be a sensible approach, but it is not what is actually happening in any WSGI environment I have tested. For example wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if it were working. (It is currently broken in 3.0rc2; I put a hack in to get it running but I'm not really sure what the current status of simple_server in 3.0 is.) A lot of what you write about has to do with CGI, which is the only place WSGI interacts with os.environ. CGI is really an aspect of the CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI spec itself. Indeed, but we naturally have to take into account implementability on CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 8859-1 decoding — or UTF-8, which is the other sensible option given that most URIs today are UTF-8 — then there cannot be a fully-compliant CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was first getting off the ground, but IMO it's still important. Personally I'm more inclined to set up a policy on the WSGI server itself with respect to the encoding, and then use real unicode characters. I think we are stuck with Unicode environ at this point, given the CGI issue. But applications do need to know about the encoding in use, because they will (typically) be generating their own links. So an optional way to get that information to the application would be advantageous. I'm now of the opinion that the best way to do this is to standardise Apache's ‘REQUEST_URI’ as an optional environ item. This header is pre-URI-decoding, containing only %-sequences and not real high bytes, so it can be decoded to Unicode using any old charset without worry. An application wanting to support Unicode URIs (or encoded slashes in URIs*) could then sniff for REQUEST_URI and use it in preference to PATH_INFO where available. This is a bit more work for the application, but it should generally be handled transparently by a library/framework and supporting PATH_INFO in a portable fashion already has warts thanks to IIS's bugs, so the situation is not much worse than it already is. And of course we get support through mod_cgi and mod_wsgi automatically, so Graham doesn't have to do anything. :-) Graham Dumpleton wrote: I can't really remember what the outcome of the discussion was. Not too much outcome really, unfortunately! You concluded: there possibly still is an open question there on how encoding of non ascii characters works in practice. We just need to do some actual tests to see what happens and whether there is a problem. ...to which the answer is — judging by the results posted — probably “yes”, I'm afraid! -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets
Ian Bicking wrote: This is something messed up with CGI on NT, and whatever server you are using, and perhaps the CGI adapter (maybe there's a way to get the raw environment without any encoding, for example?) Python decodes the environ to its own copy (wrapped in os.environ) at interpreter startup time; there's no way to query the real ‘live’ environment that I know of. It'd require a C extension. Honestly I don't know if anyone is doing anything with WSGI and Python 3. I know Graham has done some work on mod_wsgi for 3.0, but no, I don't know anyone using it in anger. Is it worth submitting patches to simple_server to make it run on 3.0? Is it too late to include at this stage anyway? Shipping 3.0 with a non-functional wsgiref is a bit embarrassing. I assume there is some way to get at the bytes in the environment, if not then that is a Python 3 bug. There is not, and this appears to be deliberate. I think it might be feasible to support an encoded version of SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, and I don't know of any particular standard to base those names on), moving from the two keys to a single REQUEST_URI is not feasible. That's certainly a possibility, but I feel it's easier to hitch a ride on the existing header, which despite being non-standard is still quite widely used. I guess you'd probably count segments, try to catch %2f (where the segments won't match up), and then double check that the decoded REQUEST_URI matches SCRIPT_NAME+PATH_INFO. I'm currently testing with just the segment counting. It's only necessary that the segments from SCRIPT_NAME are matched and stripped, and those are extremely unlikely to contain ‘%2F’ because: - there aren't many filesystems that can accept ‘/’ as a filename character. RISC OS is the only one I can think of, and it by convention swaps ‘/’ and ‘.’ to compensate as it is, so even there you couldn't use ‘%2F’; - there aren't many webservers that can map a file or alias to a path containing ‘%2F’; - no-one wants to mount a webapp alias at such a weird name — it's only in the section corresponding to PATH_INFO that ‘%2F’ might ever be of use in practice. In the worst case, many applications already know and can strip the URL at which they're mounted, but unless there's a legitimate ‘%2F’ in their SCRIPT_NAME it doesn't actually matter. frankly IIS is probably less relevant to most developers than CGI. Er... really? You and I may not favour it, but it's ≈35% of the world out there, not something we can afford to ignore IMO. So if IIS has problems with PATH_INFO, the WSGI adapter (be it CGI or otherwise) should be configured to fix those problems up front. What I'm saying is that neither Apache's nor IIS's behaviour can be considered clearly correct or wrong at this point, and there is no way a WSGI adapter living underneath them *can* fix up the differences. (There is an problem with PATH_INFO that a WSGI adapter *could* clear up, which is that IIS makes PATH_INFO the entire path including SCRIPT_NAME. I'm not sure whether it's worth fixing that up in the adapter layer though... it's possible some frameworks are already dealing with it, and might even be relying on it!) -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
[Web-SIG] WSGI Amendments thoughts: the horror of charsets
. Apache on Windows always uses ISO-8859-1 to decode the request path and put it in the Unicode envvars. This is OK so far, we have Unicode characters with the same codepoints as the original bytes. However, Python2 needs to make the envvars available as bytes. It uses the system default encoding; if that were ISO-8859-1, we'd be OK. But it never is. Western European on NT is actually cp1252, whose characters in the range 0x80 to 0x9F differ from ISO-8859-1. And if the app wants UTF-8, chances are those characters are going to come up a lot. There is as far as I know no user-selectable Windows codepage that can map all the Unicode characters up to U+00FF. *** Apache/NT/Python3 Wrong, but always recoverable. Python retreives the bytes-encoded-into-Unicode-codepoints string directly from the envvars. If the encoding should have been UTF-8 or something else other than ISO-8859-1, we can recover the original bytes by re-encoding to 8859-1, then decoding using the real charset. *** IIS/NT/Python2 Mostly unrecoverable data loss. IIS decodes submitted bytes to Unicode using UTF-8 when it can. But if there is an invalid UTF-8 sequence in the bytes it will try again using the system codepage. Python will then re-encode the Unicode envvar using the system codepage. If the app is expecting UTF-8 we can decode what Python gives us using the system codepage (ie. 'mbcs') and get back any of the submitted characters that happened to be in this server's system codepage. Other characters may be replaced by question marks or Windows's best attempts to give us something useful, which at best may be a character shorn of diacriticals and at worst something just completely wrong. NT's system codepage is never UTF-8, it is not a user-selectable option never mind the default. We can improve our chances of getting more characters through by using a character set with a wide repertoire, such as cp932 (Shift-JIS). But it's still not really proper Unicode support. If the app is expecting something non-UTF-8 there's not much hope. Even if it wanted the same character set as the system codepage, it can't be sure that the submitted bytes didn't happen to also be a valid UTF-8 sequence, and thus get mangled by IIS decoding them that way. *** IIS/NT/Python3 OK, as long as the app wants UTF-8. Incoming UTF-8 bytes are reliably converted to Unicode strings by IIS, and directly read by Python from the envvars. If the application didn't want UTF-8 the situation is about as hopeless as with Python2. *** wsgiref.simple_server/(any)/Python2 OK. Bytes all the way through. *** wsgiref.simple_server/(any)/Python3: Probably will be OK, as long as the app wants UTF-8. simple_server is currently broken in rc2. However judging by the code, it is using urllib.parse.unquote, which assumes UTF-8, so it'll be fine for apps that want UTF-8 and hopeless for those that don't. I'd be very interested to hear what other servers are doing in this situation - nginx? cherrypy's one? - and wonder if any particular behaviour should be 'blessed'. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] problem with wsgiref.util.request_uri and decoded uri
Manlio Perillo wrote: On the other hand, if the WSGI gateway *do* decode the uri, I can no more handle '/' in uri. Correct. CGI requires that '%2F' is decoded, and hence indistinguishable from '/' when it gets to the application. And WSGI inherits CGI's flaws for compatibility. request_uri is doing the right thing in assuming that if you got a '%40' in your PATH_INFO, it must originally have been a '%2540'. It is an irritating limitation, but so far not irritating enough for an optional workaround to have made its way into non-CGI-based WSGI servers. It may become a bigger irritation as we move to Py3K, and get stuck with decoded top-bit-set characters being turned into Unicode using the system encoding (which is likely to be wrong). Windows already suffers from similar problems as its environment variables are natively Unicode, and its system encoding is never UTF-8 (which is the most likely encoding for path parts). Where can I find informations about alternate encoding scheme? It's easy enough to roll your own. For example htmlform uses a scheme of encoding path parts to '+XX' instead of '%XX'. encode_re= re.compile('[^-_.!~*()\'0-9a-zA-Z]') decode_re= re.compile(r'\+([0-9a-zA-Z][0-9a-zA-Z])') def encode(s): return encode_re.sub(lambda m: '+%02X' % (ord(m.group())), s) def decode(s): decode_re.sub(lambda m: chr(int(m.group(1),16)), s) -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
James Y Knight wrote: In addition, I know of nobody who actually implements RFC 2047 decoding of http header values...nothing really uses it. (of course I don't know of all implementations out there.) Certainly no browser supports it, which makes the point moot for WSGI. Most browsers, when quoting a header parameter, simply encode using the previous page's charset and put quotes around it... even if the parameter has a quote or control codes in it. Ian wrote: Is this all compatible with os.environ in py3k? In 3.0a2 os.environ has Unicode strings for both keys and values. This is correct for Windows where environment variables are explicitly Unicode, but questionable (IMO) for Unix where they're really bytes that may or may not represent decodeable Unicode strings. SCRIPT_NAME/PATH_INFO This already causes problems in Windows CGI applications! Because these are passed in environment variables, IIS* has to decode the submitted bytes to Unicode first. It seems always to choose UTF-8 for this job, which I suppose is the least bad guess, but hardly infallible. (* - haven't tested this with Apache for Windows yet.) In Python 2.x, os.environ being byte strings, Python/the C library then has to encode them back to bytes, which I believe ends up using the system codepage. Since the system codepage is never UTF-8 on Windows this means not only that the bytes read back from eg. PATH_INFO are not the same as the original bytes submitted to the web server, but that if there are characters outside the system codepage submitted, they'll be unrecoverable. If os.environ remains Unicode in Unix and WSGI follows it (as it must if CGI-invoked WSGI is to continue working smoothly), webapps that try to allow for non-ASCII characters in URLs are likely to get some nasty deployment problems that depend on the system encoding setting, something that will be particularly troublesome for end-users to debug and fix. OTOH making the dictionaries reflect the underlying OS's conception of environment variables means users of os.environ and WSGI will have to be able to cope with both bytes and unicode, which would also be a big annoyance. In summary: urgh, this is all messy and 'orrible. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Adam Atlas [EMAIL PROTECTED] wrote: I'd say it would be best to only accept `bytes` objects +1. HTTP is inherently byte-based. Any translation between bytes and unicode characters should be done at a higher level, by whatever web framework is living above WSGI. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com