Re: [Web-SIG] resources for porting wsgi apps from python 2 to 3

2012-10-02 Thread And Clover

On 01/10/12 18:07, chris.d...@gmail.com wrote:
 * Use bytes or str for environ keys?
 * Use bytes or str for environ values?

str, decoded from the request bytes using ISO-8859-1.

   * Are all environ values created equal or would, for example,
 QUERY_STRING's value (prior to any parameter to decoding)
 be handled differently from HTTP_COOKIE

All environ values are created equal (other than the CGI-mandated odd 
decoding behaviour of SCRIPT_NAME and PATH_INFO).


   * If str, I see that ISO-8859-1 is the assumed encoding. How much
 hurt occurs in the world if I just assume utf-8 when decoding to
 str[4]?

Immediately, all non-ASCII characters in the path would be interpreted 
incorrectly.


The more general hurt to the world would be that we would continue the 
sad pre-PEP situation where every web server handles non-ASCII 
characters differently, and so no WSGI application can reliably use 
Unicode in path segments.


There is little impact to any header other than the path, because 
non-ASCII characters almost never appear in them. The query string 
remains %-encoded so any non-ASCII characters are safe. The other places 
users can put non-ASCII characters are in cookies and HTTP Basic 
Authorisation headers, but browser support here is so variable/broken 
that Python's handling would be the least of your worries.


 [4] Which is what it should have been all along?

Not necessarily. Even if you decide that all web apps must use UTF-8 for 
text encoding, it's valid to have URL-encoded, non-text binary data in a 
path segment. This would be unrecoverable using straight UTF-8.


(They would be recoverable if surrogateescape were used, but PEP  
has to encompass language versions that don't have surrogateescape, and 
also it's questionable whether it should be possible to smuggle 
non-UTF-8 data into strings that applications assume are safe.)


Plus header values are less likely to be UTF-8, and HTTP specifies that 
they're ISO-8859-1 (even if that is not well-observed by browsers).


Ideally, the interfaces should all be bytes, because HTTP is defined in 
terms of bytes. But that plays poorly with Python 3's default Unicode 
strs (for environ et al). So ISO-8859-1 was chosen as  a str interface 
for which the original bytes can at least be recovered.


 * Should start_response only accept bytes (and error if not), or
   should it also accept str and encode appropriately?

status and response_headers are, like the request headers, native str 
(to be ISO-8859-1 encoded). It's only the HTTP entity body that is 
always bytestring.


 * Should the returned iterable be rejected or encoded if not bytes?

I don't think it's specified by the PEP, but wsgiref looks like it'll 
chuck TypeError when it tries to write str to the buffer/socket.


cheers,

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobi...@gmail.com
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] wsgiref 0.2 dev in svn w/PEP 3333 support

2010-10-09 Thread And Clover

On 10/06/2010 07:21 PM, P.J. Eby wrote:


How would these relate to the Python 3.2 release? Can you make 3.x and
2.x versions?


Yes, I have separate fixup code paths for 2.x and 3.x. 3.x faces the 
reverse situation to that previously described, in that os.environ is 
accurate on Windows but needs reverse-decoding on POSIX. Currently I use 
utf-8 and surrogateescape, but for Python 3.2 presumably os.environb 
will be the safer bet.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Most WSGI servers close connections to early.

2010-09-27 Thread And Clover

On 09/22/2010 02:46 PM, Marcel Hellkamp wrote:

An application should read all available data from
`environ['wsgi.input']` on POST or PUT requests, even if it does not
process that data. Otherwise, the client might fail to complete the
request and not display the response.


Oh, it's worse than that. In practice the application needs to read all 
available data from the request body before producing output.


If you send too much response without reading the whole request body in 
some environments, you can deadlock. The web server is buffering the 
input stream for the request body and also the output stream from the 
app. This needs to be done[1] to avoid sending an HTTP response before 
the request is complete.


If those are limited-size buffers[2] and you fill the output buffer with 
response without clearing enough of the input buffer that the browser 
can finish sending the request, you'll be blocking indefinitely on write.


[1] possibly unless HTTP pipelining is in effect? not sure, haven't tested.

[2] and certainly in IIS they are. The output buffer is 8K IIRC. It's 
easy to overflow that and get a mysterious non-responsive script because 
an error happens and spits out a debugging page before the form-reading 
library has had a chance to consume the input.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] PEP 444 (aka Web3)

2010-09-17 Thread And Clover

On 09/17/2010 04:21 AM, Ian Bicking wrote:


Yes, if we get rid of SCRIPT_NAME/PATH_INFO then the problem goes away.  For
servers without access to the unencoded value, reencoding those values
doesn't actually lose any information over what we have now, and avoids any
encoding issues.


It doesn't lose any information, but it also makes script_name/path_info 
inherently unreliable. My fear is that if gateways are allowed to create 
a reconstructed script_name/path_info without clearly signalling they 
have done so, those values will continue to be unreliable at all times 
and server authors won't feel the need to get it right since it's broken 
everywhere anyway: the unhappy status quo.


This is why I am continuing to plead for a 'script_name/path_info are 
authoritative' flag in environ that applications can use to detect 
situations where it is safe to go ahead and rely on them. I want to say 
Unicode paths are supported if your server/gateway does, not Unicode 
paths might sometimes work, depending on how you configure your server 
and application.


It is not just CGI that is affected here! IIS does not provide the 
original undecoded path at all, even through ISAPI.


At the moment I am using a 'fixPathInfo' method in my form-reading layer 
to try to compensate as much as possible for the problems of CGI:


  - on Python 2 on Windows, re-read the environment variables using
ctypes if available, to avoid the mangling caused by reading
os.environ using mbcs. (This didn't used to work, as old versions
of IIS deliberately mbcs-filtered values before putting them in the
environment, but it does now.)

  - on Python 3 on POSIX, re-read the environment variables using
environb if available. Otherwise try to reverse the faulty decoding
of environ using surrogateescapes, where available.

  - on Windows, encode the Unicode environment to bytes using
ISO-8859-1 if the server is Apache, or UTF-8 is the server is
IIS. (IIS tries to decode path bytes using UTF-8, falling back
to mbcs where the input is not valid UTF-8. Unfortunately there
is no way to tell this has happened.)

  - when server is Microsoft-IIS, remove the erroneously repeated
SCRIPT_NAME components from the front of PATH_INFO. (This is a
long-standing bug that can be configured away using the
allowPathInfo/AllowPathInfoForScriptMappings configs, but no-
one does as it breaks ASP.)

However, the form layer is not really the right place to be doing these 
hacks. It would be better done in the stdlib CGI handler.



Servers with REQUEST_URI can at least attempt to
reconstruct the encoded values.


This is slightly unsafe. It's something an application might want to do 
(or at least provide as an option), but a gateway probably couldn't get 
away with it for the general case because REQUEST_URI doesn't reflect 
the redirections done by a RewriteRule or an ErrorDocument.



Cookie is also the one header that can't be safely folded.


There are others, eg. Authorization. Anyway: folding doesn't happen in 
the HTTP world. It can be forgotten about.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] PEP 444 (aka Web3)

2010-09-17 Thread And Clover

On 09/17/2010 02:03 PM, Armin Ronacher wrote:


In case we change the spec as Ian mentioned above, I am all for
a wsgi.guessed_encoding = True flag or something like that.


Yes, I'd like to see that. I believe going with *only* a 
raw-or-reconstructed path_info, rather than having both path_info and 
PATH_INFO, is probably best, for the middleware-dupication reasons PJE 
mentioned.


A more in-depth possibility might be:

wsgi.path_accuracy =

0: script_name/path_info have been crudely reconstructed from
SCRIPT_NAME/PATH_INFO from an unknown source. Beware!
If there is to be backwards compatibility with WSGI1, this
would be seen as the 'default value' given a missing path_accuracy.

1: script_name/path_info have been reconstructed, but it is known
that path_info is accurate, other than %2F and non-ASCII issues.
That is, it's known that the path doesn't come from IIS's broken
PATH_INFO, or the IIS error has been detected and compensated for.

2: script_name/path_info have been reconstructed using known-good
encodings for the env. The only way in which they may differ from
the original request path is that a slash might originally have
been a %2F. (This is good enough for the vast majority of
applications.)

3: script_name/path_info come directly from the request path
without any intervening mangling.


Unless I am mistaken, the same is true for CGI scripts running on
Apache2 on Windows.


Yes, it's true of *all* CGI scripts, but also for non-CGI scripts on IIS.


I did some tests a while ago and was pretty sure that Apache2 on Windows
did the same.


Apache-on-Windows puts the bytes of the decoded path into the 
environment variables as one code unit per byte: that is, as if encoded 
by ISO-8859-1. You still have to read the environ using ctypes because 
mbcs is never ISO-8859-1, but at least the original bytes are 
recoverable, which isn't the case with IIS.



The correct place for these hacks would be the appropriate WSGI/Web3
handler of the webserver.


The IIS PATH_INFO-prefix hack would be appropriate to put in an 
IIS-specific handler; indeed, I believe isapi_wsgi does just that. But 
the other hacks are specific to CGI.


For CGI, there is no 'handler of the webserver', there is only the 
standard CGI-to-WSGI adapter, so this is the only component it is 
reasonable to burden with the hacks. Frameworks and libraries further up 
the stack cannot reliably do the fixups, because they don't know whether 
the WSGI environ they have been given comes from os.environ or somewhere 
else, or whether middleware has played with it.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] PEP 444 (aka Web3)

2010-09-16 Thread And Clover

On 09/16/2010 02:05 AM, P.J. Eby wrote:


note that the spec's sample CGI
implementation does not itself provide the new variables


It can't: This is the original URL-encoded value derived from the 
request URI. If the server cannot provide this value, it must omit it 
from the environ. A CGI gateway doesn't have access to the original 
URL-encoded value.



middleware must be explicitly written to handle the case where there is
duplication.


The alternative to duplication would be to allow a gateway to try to 
'reconstruct' `path_info` from CGI `PATH_INFO`.


If this is done there really needs to be a flag somewhere to say that it 
has been done, ie. that `/` and non-ASCII characters in the path are 
unreliable. Otherwise we're just going to end up in the same sorry 
situation we have today where all sorts of different encodings and 
corruptions lurk inside PATH_INFO and apps simply cannot rely on it.


chr...@plope.com wrote:

 The most sensible thing to me would be to put it in
PATH_INFO.

Please don't have a field with encoded semantics that re-uses the name 
of a field that has always had decoded semantics.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread And Clover

On 07/14/2010 06:43 AM, Ian Bicking wrote:


There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
and HTTP_COOKIE.


(And of those, PATH_INFO is the only one that really matters, in that 
no-one really uses non-ASCII script filenames, and non-ASCII characters 
in Cookie/Set-Cookie are still handled so differently/brokenly across 
browsers that you can't rely on them at all.)



* I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
exclusively with encoded versions


For compatibility with existing apps, how about keeping the existing 
SCRIPT_NAME and PATH_INFO as-is (with all their problems), and 
specifying that the new 'raw' versions (whatever they are called) are 
added only if they really are raw, not reconstructed.


Then existing scripts that don't care about non-ASCII and slashes can 
carry on as before, and for apps that do care about them, they'll be 
able to be *sure* the input is correct. Or they can fall back to 
PATH_INFO when not present, and avoid producing these kind of URLs in 
response.


(Or an app might have enough special knowledge to try other fallback 
mechanisms when the raw versions are unavailable, such as REQUEST_URI or 
Windows ctypes envvar hacking. But if the server/gateway has good raw 
paths it shouldn't bother use these.)


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread And Clover

On 07/16/2010 12:07 PM, Graham Dumpleton wrote:


If you do that, one has to ask the question, given it is more convention than
anything, why it isn't just a x-wsgiorg extension specification


Yes, fine by me either way.

I just want to be able to say this application can use Unicode paths 
when run on a server/gateway that supports standardised feature X, 
rather than the current mess of you can have Unicode paths if you use 
one of the dozen different server-and-platform combinations we've 
specifically coded workarounds for.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Draft PEP: WSGI 1.1

2010-04-15 Thread And Clover

Dirkjan Ochtman wrote:


1. The application is passed an instance of a Python dictionary
   containing what is referred to as the WSGI environment. All keys
   in this dictionary are native strings. For CGI variables, all names
   are going to be ISO-8859-1 and so where native strings are
   unicode strings, that encoding is used for the names of CGI
   variables.


Perhaps explain where those ISO-8859-1 bytes might come from:

...are native strings. Where native strings are Unicode, any
keys derived from byte-oriented sources (such as custom headers
in the HTTP request reflected in the CGI environment variables)
should be decoded using the ISO-8859-1 encoding.


3. For the CGI variables contained in the WSGI environment, the values
   of the variables are native strings. Where native strings are
   unicode strings, ISO-8859-1 encoding would be used such that the
   original character data is preserved and as necessary the unicode
   string can be converted back to bytes and thence decoded to unicode
   again using a different encoding.


Good. The only problem that remains with this is that in certain 
environments (notably: all IIS use, not just CGI) a WSGI gateway cannot 
fully comply with this requirement.


a. disallow environments that cannot be sure they are preserving the 
original byte data from declaring that they support wsgi.version 1.1?


b. add an extra wsgi.something flag for a WSGI server to add, to specify 
that it is sure that the original bytes have been preserved? (ie. so 
wsgiref's CGI handler would have to declare it wasn't sure when running 
under Windows.)


c. just let WSGI gateways silently ignore the ISO-8859-1 requirement if 
they can't honour it and let the application spend its time trying to 
unravel the mess (status quo).


(Can wsgiref be fixed to use ISO-8859-1 in time for Python 3.2?)


7. The iterable returned by the application and from which response
   content is derived, should yield byte strings. Where native strings
   are unicode strings, the native string type can also be returned in
   which case it would be encoded as ISO-8859-1.



8. The value passed to the 'write()' callback returned by
   'start_response()' should be a byte string. Where native strings
   are unicode strings, a native string type can also be supplied, in
   which case it would be encoded as ISO-8859-1.


Weren't we going to only allow US-ASCII for the output? (These threads 
are always so far apart I can never remember what conclusion we 
reached... if any.)


Whilst ISO-8859-1 is in the HTTP standard for headers, and required to 
preserve bytes in input, it's much, much less likely that the response 
body is going to be ISO-8859-1. It could maybe be cp1252, but more 
likely the author wanted UTF-8.


If we must support Unicode strings for response body output at all, I'd 
prefer to be conservative here and spit a UnicodeEncodeError straight 
away, rather than quietly mangle characters U+0080 to U+00FF.


Manlio Perillo wrote:


The run_with_cgi sample function should be changed, since it probably
does not work correctly, on Python 3.x.


Yes, the 'URL Reconstruction' fragment will be wrong too, since it uses 
urllib.quote() to encode the path part. quote() defaults to UTF-8 rather 
than the ISO-8859-1 WSGI 1.1 requires.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] CGI WSGI and Unicode

2009-12-08 Thread And Clover

Manlio Perillo wrote:


In a CGI application, HTTP headers are Unicode strings, and are decoded
using system default encoding.



In a future WSGI application, HTTP headers are Unicode strings, and are
decoded using latin-1 encoding.


Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the 
decode stage caused by reading environ using the default encoding. At 
least this is now reliably possible thanks to surrogateescape.


PATH_INFO is the only really important HTTP-related environment variable 
for Unicode. Potentially SCRIPT_NAME could also be significant in 
relation to PATH_INFO. The HTTP headers don't massively matter because 
there are almost never any non-ASCII characters in them.


Previously the job of undoing an unwanted decode step was dumped on 
whatever read the PATH_INFO; usually a routing component, which would 
have to make guesses with typically poor results. The CGI adapter is in 
a much better place to do it, being closer to the server.


 The problem is that not all browsers use latin-1.

Not WSGI's problem. WSGI will deliver bytes encoded into Unicode 
strings, not ready-to-use Unicode strings. It is up to the application 
to decide how they want to handle those bytes; maybe they want Latin-1 
and can do nothing, maybe they want to recode to UTF-8, maybe something 
else completely. No solution satisfies every app so there is always 
going to have to be a recode step somewhere.


An application that doesn't want to think about this will use a 
framework that does it for them.


 What about HTTP_COOKIE?

For what it's worth, the choice of Latin-1 here results in the 'right' 
Unicode string for more browsers than any other potential encoding.


In any case as previously discussed, non-ASCII cookies are already 
totally broken everywhere and hence used by no-one.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread And Clover

Manlio Perillo wrote:


However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using



  url.decode('utf-8', 'surrogateescape')



Is this correct?


The currently-discussed proposal is ISO-8859-1, allowing the real bytes 
to be trivially extracted. This is consistent with the other headers and 
would be my preferred approach.


Python 3.1's wsgiref.simple_server, on the other hand, blindly uses 
urllib.unquote, which defaults to UTF-8 without surrogateescape, 
mangling any non-UTF-8 input.


I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding 
is blessed. But *something* needs to be blessed. An encoding, an 
alternative undecoded path_info, both, something else... just *something*.



Let's consider the `wsgiref.util.application_uri` function
There is a potential problem, here, with the quote function.


Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 
3.0, but still broken. Until we can come to a Pronouncement on what WSGI 
*is* in Python 3, it is meaningless anyway.



Cookie data SHOULD be transparent to the server/gateway; however WSGI is
going to assume that data is encoded in latin-1.


Yeah. This is no big deal because non-ASCII characters in cookies are 
already broken everywhere(*). Given this and other limitations on what 
characters can go in cookies, they are habitually encoded using ad-hoc 
mechanisms handled by the application (typically a round of URL-encoding).


*: in particular:

- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
  mangling any characters that don't fit in the codepage through the
  traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
  gets through but everything else is mangled)
- Safari refuses to send any cookie containing non-ASCII characters.


I don't know what the HTTP/Cookie spec says about this.


The traditional interpretation of RFC2616 is that headers are ISO-8859-1.

You will notice that no browser correctly follows this.

...sigh.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread And Clover

Manlio Perillo wrote:


I have written a simple WSGI application that asks authentication
credentials


Ho ho! This is another area that is Completely Broken Everywhere. It's 
actually a similar situation to the cookies:


- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
  mangling any characters that don't fit in the codepage through the
  traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
  gets through but everything else is mangled)
- Safari uses ISO-8859-1, and refuses to send any cookie containing
  characters outside the 8859-1 repertoire.
- Konqueror uses ISO-8859-1, and replaces any non-8859-1 character
  with a question mark.

The HTTP standard has nothing to say about the encoding in use *inside* 
the base64-encoded Authorization byte-string token. It's anyone's guess, 
and every browser has guessed differently. (Safari here is at least 
slightly better than its behaviour with the cookies.)


 (and I suspect that [IE] always use this encoding, instead of
 iso-8859-1).

It will certainly never send ISO-8859-1, but what it does send is locale 
dependent. Type an e-acute in your username on a Western machine and 
it'll send one byte sequence; type the same thing on an Eastern European 
Windows install and you'll get something quite different.



Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac'



I don't know where \xac come from


It's the low byte of UCS-2 codepoint U+20AC (EURO SIGN). Firefox simply 
discards the top 8 bits of each codepoint.



Unfortunately I can not test with IE 7 and 8.


The behaviour has not changed.

 This is really a mess.

Isn't it.

 How is authorization username handled in common WSGI frameworks?

No-one supports non-ASCII characters in Authentication. Most web authors 
simply move to cookies instead.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread And Clover

Manlio Perillo wrote:


Words of *TEXT MAY contain characters from character sets other than
ISO-8859-1 [22] only when encoded according to the rules of RFC 2047


Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to 
RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself 
specifically denies that an encoded-word can go in a quoted-string.


RFC2047 encoded-words are not on-topic in an HTTP header(*); this has 
been confirmed by newer development work on HTTPbis by Reschke et al. 
(http://tools.ietf.org/wg/httpbis/).


The correct way of escaping header parameters in an RFC*822-family 
protocol would be RFC2231's complex encoding scheme, but HTTP is 
explicitly not an 822-family protocol despite sharing many of the same 
constructs. See 
http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a 
strategy for how 2231 should interact with HTTP, but note that for now 
RFC2231-in-HTTP simply does not exist in any deployed tools.


So for now there is basically nothing useful WSGI can do other than 
provide direct, byte-oriented (even if wrapped in 8859-1 unicode 
strings) access to headers.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-11-30 Thread And Clover

Graham Dumpleton wrote:


Answering my own question, it is actually obvious that it has to be
called (1, 0). This is because wsgiref in Python 3.X already calls it
(1, 0) and don't have much choice to be in agreement with that.


wsgiref.simple_server in Python 3 to date is not something that anyone 
should worry about being compatible with. It is a 2to3 hack that cannot 
meaningfully claim to represent wsgi version anything.


Careless use of urllib.parse.unquote causes 3.0's simple_server not to 
work at all, and 3.1's to mangle the path by treating it as UTF-8 
instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even 
mod_cgi via wsgiref.CGIHandler) delivered.


Yes, I'm always going on about Unicode paths. I'm fed up of shipping 
apps with a page-long deployment note about fixing them. It pains me 
that in so many years both this and What do we do about Python 3? 
still haven't been addressed.


mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would 
prefer not to see more farcical reverse-progress at this point.


For what it's worth my responses on the issues of this thread. But at 
this point I really just want a BDFL to just come and do it, whatever it 
is. A new WSGI, whatever the version number, is massively overdue.


 1. The 'readline()' function of 'wsgi.input' may optionally take a 
size hint.


Yes. Obviously. Bad practice but unavoidable now. Should have been a 1.0 
amendment a long time ago.


 2. The 'wsgi.input' must provide an empty string as end of input 
stream marker.
 3. The size argument to 'read()' function of 'wsgi.input' would be 
optional and if not supplied the function would return all available 
request content.
 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour 
the Content-Length response header and must only return from the file 
that amount of content.


+0. Seems reasonable but don't massively care. Presumably an application 
must refuse to run on 1.0 if it requires these behaviours?


 5. Any WSGI application or middleware should not return more data 
than specified by the Content-Length response header if defined.
 6. The WSGI adapter must not pass on to the server any data above 
what the Content-Length response header defines if supplied.


Yes.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Future of WSGI

2009-11-26 Thread And Clover

Ian Bicking wrote:


The proposal that seemed to work best was to keep the environ as str (i.e.,
unicode in Python 3), and eliminate the problematic SCRIPT_NAME and
PATH_INFO, replacing them with url-encoded values.


Ah, OK, if that's where we got to I'm happy with that - as long as the 
application/framework can tell the difference between (a) old-school 
WSGI 1.0 decoded PATH_INFO, (b) new verbatim PATH_INFO, and (c) a new 
verbatim PATH_INFO that has been created from an old PATH_INFO by a WSGI 
handler unfortunate enough to be running under CGI or IIS, potentially 
including mangled characters. I would prefer to avoid the latter completely.


This could be achieved by giving the new variables a different name and 
only including them if they're safe (leaving the application to fall 
back to the old variables where unavailable), or by having a flag to 
specify they're verbatim and leaving it unset when unmangled verbatim is 
unavailable.



Also I think everyone is okay with removing start_response.


+0.5: very much happy to see it gone, but if it causes any more delay to 
a WSGI update I'm also not unhappy if it stays. My primary concern is 
that a Python-3-compatible WSGI is available as soon as possible; every 
long argument in here seems to lead to no resolution. I want to release 
Python 3 web code, and cannot whilst WSGI remains in flux.


Whilst in principle I kind of agree with Malthe that keeping the 
CGI-derived environ separate from items like wsgi.input would be 
appropriate, in practice I don't give a stuff about it: the merged 
dictionary causes no practical problems, and changing it would be an 
enormous upheaval for all WSGI users.


WSGI doesn't need to be pretty, it needs to be widely-compatible. 
Authors who want pretty can use frameworks, which will be happy to 
deliver elegant Request and Response objects.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO

2009-09-24 Thread And Clover

Ian wrote:


some environments provide only the unquoted path.  I think it's not
terribly horrible if they fake it by re-quoting the path.


If they are faking it, there should IMO be a way to flag that it's 
faked. Then an application that uses IRIs may choose to


a. generate an error, or
b. carry on, don't care about %2F and just hope the encodings match, or
c. fall back to outputting only ASCII URLs.


I also believe you can safely reconstruct the real SCRIPT_NAME/PATH_INFO
from REQUEST_URI, which is usually available


I wouldn't say 'usually', REQUEST_URI is Apache-specific. I haven't 
checked other servers to see if they copy it, but IIS certainly doesn't.


SCRIPT_NAME/PATH_INFO can differ completely from REQUEST_URI when Apache 
has done some internal redirection, for example via mod_rewrite or 
ErrorDocument. It's certainly useful as a fixup possibility (several of 
my apps optionally use it), but not something that can really be relied 
upon.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO

2009-09-23 Thread And Clover

Ian Bicking wrote:


I propose we switch primarily to native strings: str on both Python 2 and
3.


Fine.


wsgi.input remains byte-oriented, as does the response app_iter.


Good.


These both form the original path.  It is not URL decoded, so it should be
ASCII.


Great! BUT.

Undecoded script_name/path_info *cannot* be provided by some gateways: 
primarily, but not only, CGI.


Such a gateway can reconstruct what it thinks the undecoded versions 
should have been, but this is not reliably accurate. I would like a way 
(eg. a flag) for the gateway or server to specify to the application 
that script_name/path_info are potentially inaccurate. Then the 
application can react by avoiding IRI (and %2F, though arguably that 
should be avoided anyway).



This sends different text, but is highly preferable.


Yes. All schemes to send non-ASCII in cookies require sending different 
text; URL-encoding is a common choice of ad-hoc wrapping. I don't think 
WSGI has to worry too much about explaining this, it's a known hazard of 
the web in general. It doesn't work in any other environment, so nobody 
should be expecting it to work in WSGI.



What happens if you give unicode text in the response headers that cannot be
encoded as Latin1?


UnicodeEncodeError.


Should some things be unicode on Python 2?


Don't think so.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Getting back to WSGI grass roots.

2009-09-23 Thread And Clover

Graham wrote:


So, rather than throw away completely the idea of bytes everywhere,
and rewrite the WSGI specification, we could instead say that the
existing conceptual idea of WSGI 1.0 is still valid, and just build on
top of it a translation interface to present that as unicode.


I don't think we really need to. Almost nothing in WSGI itself actually 
touches Unicode. HTTP headers may in theory be ISO-8859-1 (and certainly 
should be handled as such), but in the real world they are exclusively 
ASCII (anything else breaks browsers).


SCRIPT_NAME/PATH_INFO is the only part of the spec that potentially 
needs more than ASCII, and even then the majority of apps don't put any 
Unicode characters in those (especially SCRIPT_NAME). I don't think it's 
worth adding the complication of a two-layer interface just for this one 
case.


If we can hack around SCRIPT_NAME/PATH_INFO separately as per the other 
thread there's no longer any need for anything but ASCII, so WSGI's 
strings can be bytes or unicode depending on your taste/Python-version, 
without it hurting anyone. The important job of mapping


* query parameters,
* POSTed request bodies, and
* response bodies

between bytes and unicode remains firmly in the application/framework's 
area of concern and needs no assistance from WSGI.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread And Clover

Alan Kennedy wrote:


Why can't we just do the same as the java servlet spec?


Because Servlet is a walking, stinking demonstration of how *not* to 
handle encodings.


Every servlet container has its own different method of selecting input 
character sets, and the default encoding is almost never right. Most 
deployed JSP applications out there are using the wrong charset and do 
the wrong thing with any non-ASCII character. This is not something to 
aim for.


Pushing the choice of encodings out to a 'deployment issue' where the 
application doesn't get to decide is a Wrong Thing. I hate dealing with 
this nonsense in Java and I do not want the same approach to become 
common in Python.


 I see this as being the same as Graham's suggested approach of a
 per-server configurable charset

This is absolutely the opposite of what I want as an application author. 
I want to hand out my WSGI application that uses UTF-8 and know that 
wherever it is deployed the non-ASCII characters will go through without 
getting mangled.


The application (perhaps via a framework it is using) is the party that 
is in the best place to know what character encoding it wants to deal 
with. Give the application a consistent way to handle that encoding 
itself, because the poor sod deploying it isn't going to know any better.


 Those frameworks obviously have solved all of the problems of decoding
 incoming request components from miscellaneous unknown character sets
 into unicode, with out any mistakes

Er, no. That's the point. It cannot currently be done in all deployment 
environments. When they're not running via their own built-in servers, 
the frameworks have to do the same as the rest of us: guess.


That guess may not be as troublesome as it is in Java (mainly because 
for us it doesn't affect QUERY_STRING parameters), but it's still not 
reliable, which is why today you can't have a WSGI application with 
pretty non-ASCII URLs that will deploy consistently. I want this fixed.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-22 Thread And Clover

Graham wrote:

 Armin has fast asleep now, so my shift.

Heh. It's a multiple-man job keeping up with this monster thread!


The URLs don't break.


Not in themselves. Just the language of the PEP implies that to fix them 
up would contravene the spec:


 The application MUST use [the encoding guess for PATH_INFO] to decode
 the ``'QUERY_STRING'`` as well.

This isn't appropriate even as a SHOULD: the guessed encoding for 
PATH_INFO is very likely to be wrong, in particular for cases where the 
path was purely ASCII.


The application (or a library/framework acting on its behalf) should be 
allowed to decode QUERY_STRING using whatever encoding it is expecting. 
Disallowing using anything other than utf-8 (and iso-8859-1 in a very 
unreliable way) makes it impossible to have queries in any other 
encoding at all and still comply with the spec, which is undesirable.


If this sentence is removed, and `wsgi.uri_encoding` is guaranteed to be 
one of:


  a. definitive and reliable, or
  b. missing/None

I'm pretty much happy. What I don't want is that half the future-WSGI 
servers/gateways decide they have to provide *some* value for 
`wsgi.uri_encoding` even if they're not quite sure if it's the right 
one. Then we're back to square one.



if it is known that an application or some subset of
URLs will always be receiving a request as non UTF-8, then it should
employ code in those cases to always transcode it to the required
encoding.


Yep, agreed. I think the PEP should clarify that; at the moment it is 
saying that a transcode is something you should only do for the 
iso-8859-1 case, but if you actually followed that advice you'd get 
highly inconsistent results. Perhaps we're at cross-purposes as to what 
exactly consistutes 'middleware'...



The other fallback is that a specific WSGI server could elect to
provide an option to not use 'UTF-8' as the first choice for decoding


I really, *really* hope this does not happen. That just brings us more 
deployment heartaches.



Whether surrogateescape gives a better solution I have no idea at this
point


Yeah... I'm highly suspicious of surrogateescape in a web context and 
personally my code will be deliberately filtering all such characters 
out. I can see it being a possible way to smuggle unwanted sequences 
(such as overlongs) through filters, potentially causing endless 
security problems. But we'll see...


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Request for Comments on upcoming WSGI Changes

2009-09-21 Thread And Clover

Armin Ronacher wrote:


The middleware can never know.


It's much more likely than to know than the server though!

 WSGI will demand UTF-8 URLs and only
 provide iso-XXX support for backwards compatibility.

It doesn't sound much like backwards compatibility to me if non-UTF-8 
URLs break as soon as they coincidentally happen to be UTF-8 byte 
sequences. I'm as much an advocate of UTF-8 for everything everywhere! 
as anyone else, but unfortunately today there are still dark places 
where you need non-UTF-8 URLs.


Incidentally, if wsgi.uri_encoding is going to be the way to signal that 
the server has decoded bytes to characters using a known encoding, it 
should be stressed that this should only be set when that encoding is 
certain.


That is, wsgi.uri_encoding should be omitted (or None?) in cases where 
another party has already decoded (and maybe mangled) the bytes using an 
unknown encoding. In particular, CGI.


(In the case of Windows CGI the server will have decoded URI bytes into 
Unicode characters, using a charset which it is impossible to find out. 
In Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid 
UTF sequence, otherwise it's the system codepage. This problem affects 
the non-CGI implementation isapi_wsgi, too. Then the variables are read 
as environment variables, which for Python 2 means another encode/decode 
step on Windows using the system codepage, mangling non-codepage 
characters. Python 3 has the opposite problem reading byte envvars using 
UTF-8, which won't be how Apache put them there.)


If wsgi.encoding is obligatory then in reality it will often be wrong, 
leaving us in the same pathetic predicament as with WSGI 1.0, where 
non-ASCII URIs don't work reliably at all.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-18 Thread Andrew Clover

ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...)


Hmm... it turns out: no. IIS appears to be mangling characters that are 
not in mbcs even *before* it puts the decoded value into the envvars.


The same is true with isapi_wsgi, which is the only other WSGI adapter I 
know of for IIS. This gets the same mangled byte string from 
GetServerVariable as Python gets from the envvars, so it looks like this 
is a mistake IIS is making further up before it even hits the CGI 
handler. Maybe someone more familiar with ISAPI knows a better way to 
read PATH_INFO than GetServerVariable, but I can't see anything 
promising in MSDN.


So it would seem to be impossible at the moment to have Unicode paths 
work under IIS at all.


The ctypes approach could rescue bytes for the Apache/nt/Py2 combination 
(perhaps also from libc.getenv for Apache/posix/Py3), but then Apache 
already gives us REQUEST_URI which is a much easier workaround. There 
might be CGI servers for Windows where ctypes could serve some purpose, 
but I can't think of any currently in use other than the Big Two.


In summary, to get the original submitted byte strings for PATH_INFO:

Apache/nt/Py2
process REQUEST_URI
Apache/posix/Py2
use PATH_INFO directly
(or process REQUEST_URI)
Apache/nt/Py3
encode PATH_INFO to ISO-8859-1
(or process REQUEST_URI)
Apache/posix/Py3
process REQUEST_URI
IIS/nt/Py2
decode PATH_INFO from mbcs, then encode to UTF-8
FAIL for characters not in current mbcs
FAIL for non-UTF-8 input
IIS/nt/Py3
encode PATH_INFO to UTF-8
FAIL for characters not in current mbcs
FAIL for non-UTF-8 input
wsgiref.simple_server/Py2
use PATH_INFO directly
wsgiref.simple_server/Py3
remains to be seen, but at the moment encode PATH_INFO to UTF-8
FAIL for non-UTF-8 input
cherrypy.wsgiserver/Py2
use PATH_INFO directly
cherrypy.wsgiserver/Py3
remains to be seen, but at the moment encode PATH_INFO to UTF-8
FAIL for non-UTF-8 input

--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-17 Thread Andrew Clover

Mark Hammond wrote:


I don't think Python explicitly converts it - the CRT's ANSI version
of environ is used


Yes, it would be the CRT on Python 2.x. (Python 3.0 on non-NT does a 
conversion always using UTF-8, if I'm reading convertenviron right.)



so the resulting strings should be encoded using the 'mbcs' encoding.
What mangling do you see?


Correct, it's characters unencodable in mbcs that are lost*. mbcs is 
never equivalent to UTF-8 (which would allow us to recover characters on 
IIS) or ISO-8859 (which would allow us to receover characters on 
Apache-for-Windows) so there's always heavy lossage.


(* - replaced with ? or Windows's attempt to substitute something that 
looks vaguely like the original character.)



win32api and ctypes would both let you call the Windows API.


Ah! I had considered the win32 extensions but it's a bit of a 
dependency... I'd forgotten that we get ctypes for free in 2.5.


So we'd be looking at:

ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...)

when CPython 2.5+/NT is detected, right? That increases the number of 
situations in which we can feasibly recover URIs that are valid UTF-8 
sequences (modulo the slash anyway). Doing the actual recovery still 
requires some server-sniffing though.



What is IIS doing wrong here?


It's not wrong as such. There are three reasonable choices for decoding 
header values before putting them in a Unicode environment, and the CGI 
spec, as it knows nothing about Unicode environment variables, fails to 
specify which:


1. ISO-8859-1 (which ensures bytes can be recovered)
2. UTF-8 (since most URIs are effectively UTF-8 today)
3. Configured system codepage (mbcs)

Apache [with mod_cgi or mod_wsgi] decides on (1). IIS tries for (2), 
falling back to (3) on invalid sequences. The text concerning Python 3.0 
in the WSGI Amendments page could be read as blessing Apache's behaviour.


However wsgiref.simple_server currently also goes for (2), although that 
probably can't be considered canonical. I'd be interested to know what 
other WSGI servers do.


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Revising environ['wsgi.input'].readline in the WSGI specification

2008-11-17 Thread Andrew Clover

Ian Bicking wrote:


To resolve this, let's just not pass it over this time?


+1

--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.


See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.


My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode. In other words, the values will be 
unicode, but that will simply be a lie.


Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)


A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.


Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.


Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.


I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.


I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.


An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.


And of course we get support through mod_cgi and mod_wsgi automatically, 
so Graham doesn't have to do anything. :-)


Graham Dumpleton wrote:


I can't really remember what the outcome of the discussion was.


Not too much outcome really, unfortunately! You concluded:


there possibly still is an open question there on how
encoding of non ascii characters works in practice. We just need to
do some actual tests to see what happens and whether there is a problem. 


...to which the answer is — judging by the results posted — probably 
“yes”, I'm afraid!


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-14 Thread Andrew Clover

Ian Bicking wrote:

This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?)


Python decodes the environ to its own copy (wrapped in os.environ) at 
interpreter startup time; there's no way to query the real ‘live’ 
environment that I know of. It'd require a C extension.


Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.


I know Graham has done some work on mod_wsgi for 3.0, but no, I don't 
know anyone using it in anger.


Is it worth submitting patches to simple_server to make it run on 3.0? 
Is it too late to include at this stage anyway? Shipping 3.0 with a 
non-functional wsgiref is a bit embarrassing.


I assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.


There is not, and this appears to be deliberate.

I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on),

moving from the two keys to a single REQUEST_URI is not feasible.


That's certainly a possibility, but I feel it's easier to hitch a ride 
on the existing header, which despite being non-standard is still quite 
widely used.



I guess you'd probably count segments, try to catch %2f (where the
segments won't match up), and then double check that the decoded
REQUEST_URI matches SCRIPT_NAME+PATH_INFO.


I'm currently testing with just the segment counting. It's only 
necessary that the segments from SCRIPT_NAME are matched and stripped, 
and those are extremely unlikely to contain ‘%2F’ because:


  - there aren't many filesystems that can accept ‘/’ as a filename
character. RISC OS is the only one I can think of, and it by
convention swaps ‘/’ and ‘.’ to compensate as it is, so even
there you couldn't use ‘%2F’;
  - there aren't many webservers that can map a file or alias to a
path containing ‘%2F’;
  - no-one wants to mount a webapp alias at such a weird name — it's
only in the section corresponding to PATH_INFO that ‘%2F’ might
ever be of use in practice.

In the worst case, many applications already know and can strip the URL 
at which they're mounted, but unless there's a legitimate ‘%2F’ in their 
SCRIPT_NAME it doesn't actually matter.


frankly IIS is probably less relevant to most developers than CGI. 


Er... really?

You and I may not favour it, but it's ≈35% of the world out there, not 
something we can afford to ignore IMO.


So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.


What I'm saying is that neither Apache's nor IIS's behaviour can be 
considered clearly correct or wrong at this point, and there is no way a 
WSGI adapter living underneath them *can* fix up the differences.


(There is an problem with PATH_INFO that a WSGI adapter *could* clear 
up, which is that IIS makes PATH_INFO the entire path including 
SCRIPT_NAME. I'm not sure whether it's worth fixing that up in the 
adapter layer though... it's possible some frameworks are already 
dealing with it, and might even be relying on it!)


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI Amendments thoughts: the horror of charsets

2008-11-12 Thread Andrew Clover
.

Apache on Windows always uses ISO-8859-1 to decode the request path and 
put it in the Unicode envvars. This is OK so far, we have Unicode 
characters with the same codepoints as the original bytes. However, 
Python2 needs to make the envvars available as bytes. It uses the system 
default encoding; if that were ISO-8859-1, we'd be OK.


But it never is. Western European on NT is actually cp1252, whose 
characters in the range 0x80 to 0x9F differ from ISO-8859-1. And if the 
app wants UTF-8, chances are those characters are going to come up a 
lot. There is as far as I know no user-selectable Windows codepage that 
can map all the Unicode characters up to U+00FF.



*** Apache/NT/Python3
Wrong, but always recoverable.

Python retreives the bytes-encoded-into-Unicode-codepoints string 
directly from the envvars. If the encoding should have been UTF-8 or 
something else other than ISO-8859-1, we can recover the original bytes 
by re-encoding to 8859-1, then decoding using the real charset.



*** IIS/NT/Python2
Mostly unrecoverable data loss.

IIS decodes submitted bytes to Unicode using UTF-8 when it can. But if 
there is an invalid UTF-8 sequence in the bytes it will try again using 
the system codepage. Python will then re-encode the Unicode envvar using 
the system codepage.


If the app is expecting UTF-8 we can decode what Python gives us using 
the system codepage (ie. 'mbcs') and get back any of the submitted 
characters that happened to be in this server's system codepage. Other 
characters may be replaced by question marks or Windows's best attempts 
to give us something useful, which at best may be a character shorn of 
diacriticals and at worst something just completely wrong.


NT's system codepage is never UTF-8, it is not a user-selectable option 
never mind the default. We can improve our chances of getting more 
characters through by using a character set with a wide repertoire, such 
as cp932 (Shift-JIS). But it's still not really proper Unicode support.


If the app is expecting something non-UTF-8 there's not much hope. Even 
if it wanted the same character set as the system codepage, it can't be 
sure that the submitted bytes didn't happen to also be a valid UTF-8 
sequence, and thus get mangled by IIS decoding them that way.



*** IIS/NT/Python3
OK, as long as the app wants UTF-8.

Incoming UTF-8 bytes are reliably converted to Unicode strings by IIS, 
and directly read by Python from the envvars.


If the application didn't want UTF-8 the situation is about as hopeless 
as with Python2.



*** wsgiref.simple_server/(any)/Python2
OK.

Bytes all the way through.


*** wsgiref.simple_server/(any)/Python3:
Probably will be OK, as long as the app wants UTF-8.

simple_server is currently broken in rc2. However judging by the code, 
it is using urllib.parse.unquote, which assumes UTF-8, so it'll be fine 
for apps that want UTF-8 and hopeless for those that don't.



I'd be very interested to hear what other servers are doing in this 
situation - nginx? cherrypy's one? - and wonder if any particular 
behaviour should be 'blessed'.


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] problem with wsgiref.util.request_uri and decoded uri

2008-09-10 Thread Andrew Clover

Manlio Perillo wrote:


On the other hand, if the WSGI gateway *do* decode the uri,
I can no more handle '/' in uri. 


Correct. CGI requires that '%2F' is decoded, and hence indistinguishable 
from '/' when it gets to the application. And WSGI inherits CGI's flaws 
for compatibility.


request_uri is doing the right thing in assuming that if you got a '%40' 
in your PATH_INFO, it must originally have been a '%2540'.


It is an irritating limitation, but so far not irritating enough for an 
optional workaround to have made its way into non-CGI-based WSGI servers.


It may become a bigger irritation as we move to Py3K, and get stuck with 
decoded top-bit-set characters being turned into Unicode using the 
system encoding (which is likely to be wrong). Windows already suffers 
from similar problems as its environment variables are natively Unicode, 
and its system encoding is never UTF-8 (which is the most likely 
encoding for path parts).



Where can I find informations about alternate encoding scheme?


It's easy enough to roll your own. For example htmlform uses a scheme of 
encoding path parts to '+XX' instead of '%XX'.


encode_re= re.compile('[^-_.!~*()\'0-9a-zA-Z]')
decode_re= re.compile(r'\+([0-9a-zA-Z][0-9a-zA-Z])')

def encode(s):
return encode_re.sub(lambda m: '+%02X' % (ord(m.group())), s)
def decode(s):
decode_re.sub(lambda m: chr(int(m.group(1),16)), s)

--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
James Y Knight wrote:

 In addition, I know of nobody who actually implements RFC 2047
 decoding of http header values...nothing really uses it. (of
 course I don't know of all implementations out there.)

Certainly no browser supports it, which makes the point moot for WSGI. 
Most browsers, when quoting a header parameter, simply encode using the 
previous page's charset and put quotes around it... even if the 
parameter has a quote or control codes in it.

Ian wrote:

  Is this all compatible with os.environ in py3k?

In 3.0a2 os.environ has Unicode strings for both keys and values. This 
is correct for Windows where environment variables are explicitly 
Unicode, but questionable (IMO) for Unix where they're really bytes that 
may or may not represent decodeable Unicode strings.

 SCRIPT_NAME/PATH_INFO

This already causes problems in Windows CGI applications! Because these 
are passed in environment variables, IIS* has to decode the submitted 
bytes to Unicode first. It seems always to choose UTF-8 for this job, 
which I suppose is the least bad guess, but hardly infallible.

(* - haven't tested this with Apache for Windows yet.)

In Python 2.x, os.environ being byte strings, Python/the C library then 
has to encode them back to bytes, which I believe ends up using the 
system codepage. Since the system codepage is never UTF-8 on Windows 
this means not only that the bytes read back from eg. PATH_INFO are not 
the same as the original bytes submitted to the web server, but that if 
there are characters outside the system codepage submitted, they'll be 
unrecoverable.

If os.environ remains Unicode in Unix and WSGI follows it (as it must if 
CGI-invoked WSGI is to continue working smoothly), webapps that try to 
allow for non-ASCII characters in URLs are likely to get some nasty 
deployment problems that depend on the system encoding setting, 
something that will be particularly troublesome for end-users to debug 
and fix.

OTOH making the dictionaries reflect the underlying OS's conception of 
environment variables means users of os.environ and WSGI will have to be 
able to cope with both bytes and unicode, which would also be a big 
annoyance.

In summary: urgh, this is all messy and 'orrible.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI, Python 3 and Unicode

2007-12-07 Thread Andrew Clover
Adam Atlas [EMAIL PROTECTED] wrote:

 I'd say it would be best to only accept `bytes` objects

+1. HTTP is inherently byte-based. Any translation between bytes and 
unicode characters should be done at a higher level, by whatever web 
framework is living above WSGI.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com