Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-20 Thread Henry Precheur
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote:
 However, we quite often use only a portion of the URI when attempting
 to locate an appropriate handler; sometimes just the leading /
 character! The remaining characters are often passed as function
 arguments to the handler, or stuck in some parameter list/dict. In
 many cases, the charset used to decode these values either: is
 unimportant; follows complex rules from one resource to another; or is
 merely reencoded, since the application really does care about bytes
 and not characters. Falling back to ISO-8859-1 (and minting a new WSGI
 environ entry to declare the charset which was used to decode) can
 handle all of these cases. Server configuration options cannot, at
 least not without their specification becoming unwieldy.

(Just to make things clear, I am not just talking about REQUEST_URI
here, but all request headers)


Encoding everything using ISO-8859-1 has the nice property of keeping
informations intact. It would be good heuristic if everything with a few
exceptions was encoded using ISO-8859-1. Just transcode the few
problematic cases at the application level and everybody is happy. A
string encoded from ISO-8859-1 is like a bytes object with a string
'interface' on top of it.


But it sweep the encoding problem under the carpet. The problem with
Python 2 was that str and unicode were almost the same, so much the same
that it was possible to mix them without too much problems:

   'foo' == u'foo'
  True

Python 3 made bytes and string 'incompatible' to force programmers to
handle the encoding problem as soon as possible:

   b'foo' == 'foo'
  False

By passing `str()` to the application, the application author could
believe that the encoding problem has been handled. But in most cases it
hasn't been handled at all. The application author should still
transcode all the strings incorrectly encoded. We are back to Python 2's
bad old days, where we can't be sure that what we got is properly
encoded:

  Was that string encoded using latin-1? Maybe a middleware transcoded
  it to UTF-8 before the application was called. Maybe the application
  itself transcoded it at some point, but then we need to keep track of
  what was transcoded. Maybe the application should transcode everything
  when it is called.

Also EVERY application author will have to read the PEP, especially the
paragraph saying:

   Everything we give you are strings, but you still have to deal
   with the encoding mess.

Otherwise he will have weird problems like when he was using Python 2.
Because the interface is not clear. strings are supposed to be text and
only text. Encoding everything to ISO-8859-1 means strings are not text
anymore, they are 'encoded data' [1].


bytes are supposed to be 'encoded data' and binary blobs. By giving
applications bytes, the author knows right away he should decode them.
No need to read the PEP.


`bytes` can do everything `str` can do with the notable exception of
'format'.

   b'foo bar'.title()
  b'Foo Bar'

   b'/foo/bar/fran\xc3ois'.split(b'/')
  [b'', b'foo', b'bar', b'fran\xc3ois']

   re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups()
  (b'foo', b'1234')

I understand that `bytes()` is an unfamiliar beast. But I believe the
encoding problem is the realm of the application, not the realm of the
gateway. Let the application handle the encoding problem and don't give
it a half baked solution.


Using bytes also has its set of problems. The standard library doesn't
support bytes very well. For example urllib.response.unquote() doesn't
work with bytes, and urllib.parse too has issues.

[1] 
http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread Robert Brewer
I wrote:
 Applications do produce URI's (and IRI's, etc. that need to be
 converted into URI's) and do transfer them in media types like
 HTML, which define how to encode a.href's and form.action's
 before %-encoding them [4]. But these are not the only vectors
 by which clients obtain or generate Request-URI's.
 ...
 As someone (Alan Kennedy?) noted at PyCon, static resources may
 depend upon a filename encoding defined by the OS which is
 different than that of the rest of the URI's generated/understood
 by even the most coherent application.
 ...
 In practical terms, character-by-character comparisons should be
 done codepoint-by-codepoint after conversion to a common character
 encoding. In other words, the URI spec seems to imply that the
 two URI's /a%c3%bf and /a%ff may be equivalent, if the former
 is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF
 encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about
 this, since all environ values must be byte strings. IMO WSGI
 2 should do better in this regard.
 ...
 For the three reasons above, I don't think we can assume that the
 application will always receive equivalent URI's encoded in a
 single, foreseen encoding.

Did I say 3 reasons? I meant 4: Accept-Charset.


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread P.J. Eby

At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote:

Did I say 3 reasons? I meant 4: Accept-Charset.


Chief amongst the reasons...  amongst our reasonry...  Right, we'll 
come in again.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-16 Thread Robert Brewer
I wrote:
 PATH_INFO and QUERY_STRING are ... decoded via a configurable
 charset, defaulting to UTF-8. If the path cannot be decoded
 with that charset, ISO-8859-1 is tried. Whichever is successful
 is stored at environ['REQUEST_URI_ENCODING'] so middleware and
 apps can transcode if needed.

and Ian replied:
 My understanding is that PATH_INFO *should* be UTF-8 regardless of
 what encoding a page might be in.  At least that's what I got when
 testing Firefox.  It might not be valid UTF-8 if it was manually
 constructed, but then there's little reason to think it is valid...

Actually, current browsers tend to use UTF-8 for the path, and either the 
encoding of the document [1] or Windows-1252 [2] for the querystring. But the 
vast majority of HTTP user agents are not browsers [3]. Even if that were not 
so, we should not define WSGI to only interoperate with the most current 
browsers.

and Graham added:
 Thinking about it for a while, I get the feel that having a fallback
 to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
 URLs wouldn't consistently use the same encoding all the time just
 seems wrong. I would see it as returning a bad request status. If an
 application coder knows they are actually going to be dealing with
 latin-1, as that is how the application is written, then they should
 be specifying it should be latin-1 always instead of utf-8. Thus, the
 WSGI adapter should provide a means to override what encoding is used.

Applications do produce URI's (and IRI's, etc. that need to be converted into 
URI's) and do transfer them in media types like HTML, which define how to 
encode a.href's and form.action's before %-encoding them [4]. But these are not 
the only vectors by which clients obtain or generate Request-URI's.

 For simple WSGI adapters which only service one WGSI application, then
 it would apply to whole URL namespace.

As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a 
filename encoding defined by the OS which is different than that of the rest of 
the URI's generated/understood by even the most coherent application.

The encoding used for a URI is only really important for one reason: URI 
comparison. Comparison is at the heart of handler dispatch, static resource 
identification, and proper HTTP cache operation. It is for these reasons that 
RFC 3986 has an extensive section on the matter [5], including a ladder of 
approaches:

 * Simple String Comparison
 * Case Normalization (e.g. /a%3D == /a%3d)
 * Percent-Encoding Normalization (e.g. /a%62c == /abc)
 * Path Segment Normalization (e.g. /abc/../def == /def)
 * Scheme-Based Normalization (e.g. http://example.com == 
http://example.com:80/)
 * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing 
showed it to be)

I think it would be beneficial to those who develop WSGI application interfaces 
to be able to assume that at least case-, percent-, path-, and 
scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by 
all WSGI 2 origin servers.

All of those except for the first one can be accomplished without decoding the 
target URI. But that first section specifically states: In practical terms, 
character-by-character comparisons should be done codepoint-by-codepoint after 
conversion to a common character encoding. In other words, the URI spec seems 
to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the 
former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in 
ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ 
values must be byte strings. IMO WSGI 2 should do better in this regard.

 For something like Apache where
 could map to multiple WSGI applications, then it may want to provide
 means of overriding encoding for specific subsets o URLs, ie., using
 Location directive for example.

For the three reasons above, I don't think we can assume that the application 
will always receive equivalent URI's encoded in a single, foreseen encoding. 
Yet we still haven't answered the question of how to handle unforeseen 
encodings. You're right that, if the server-side stack as a whole cannot map a 
particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 
over 400, but either is fine.

However, we quite often use only a portion of the URI when attempting to locate 
an appropriate handler; sometimes just the leading / character! The remaining 
characters are often passed as function arguments to the handler, or stuck in 
some parameter list/dict. In many cases, the charset used to decode these 
values either: is unimportant; follows complex rules from one resource to 
another; or is merely reencoded, since the application really does care about 
bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI 
environ entry to declare the charset which was used to decode) can handle all 
of these cases. Server configuration options cannot, at least 

Re: [Web-SIG] WSGI 2

2009-08-13 Thread Henry Precheur
On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote:
 Correct -- you can write any set of % encodings, and I don't think it even
 has to be able to validly url-decode (e.g., /foo%zzz will work).  It
 definitely doesn't have to be a valid encoding.  However, if you actually
 include unicode characters, they will always be encoded as UTF-8 (as goes
 with the IRI standard).  This is in a case like a href=/some page, the
 browser will request /some%20page, because it escapes unsafe characters.
  Similarly if you request a href=/fran??ais it will encode that ?? in
 UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at
 least on Firefox.  I used this to test:
 http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py

I have run some tests regarding the encoding issue:

curl doesn't 'url-encode' its URLs:

  curl 'http://hostname/fran?ais'
^
 e7 latin-1 character

The latin-1 character is send to the server. Lighttpd accepts the URL
and even return a file if it exists. Of course if I try with the same
characters in UTF-8 it doesn't work.

AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that
libcurl is quite popular (it used to be the transport library of
Webkit/GTK+ for example.) It's hard to discard it as a utterly broken 
obscure tool. Many 'simplistic' HTTP clients may have the same problem.


Now let's talk a little bit about cookies...

Cookies can contain whatever 'binary junk' the server send. RFC 2965
says (http://tools.ietf.org/html/rfc2965#page-5):

 The VALUE is opaque to the user agent and may be anything the origin
 server chooses to send, possibly in a server-selected printable ASCII
 encoding.

Also, cookies can contain 'comments' which contains UTF-8 strings.
(http://tools.ietf.org/html/rfc2965#page-6):

 Characters in value MUST be in UTF-8 encoding.

Firefox has no problem with cookies containing non-ASCII characters. It
looks like it assumes cookies are encoded using latin-1, since latin-1
characters are displayed correctly in Firebug, but not UTF-8 ones.


Cheers,

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Henry Precheur he...@precheur.org:
 Using bytes for all `environ` values is easy to understand on the
 application side as long as you are aware of the encoding problem. The
 cost is inconvenience, but that's probably OK. It's also simpler to
 implement on the gateway/server side.

Use of bytes everywhere can be inconvenient on the gateway/server
side, at least as far as end result for user.

The specific problem is that WSGI environment is used to hold
information about the original request, as CGI variables, but also can
hold user specified custom variables.

In the case of anything hosted via Apache, such as through mod_wsgi,
mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
custom variables using the SetEnv directive. Thus one might say:

  SetEnv trac.env_path /usr/local/trac/site-1

If the rule is that everything in WSGI environment coming from WSGI
adapter must be bytes then you have a potential for mismatch in
expectations of how values will be passed. That is, if set using
SetEnv then would be bytes, but if set using WSGI middleware wrapper
for configuration, more likely going to be string. It would seem
overly onerous to expect WSGI middleware to use bytes for
configuration variables as well and so force all consumers to always
be converting to string using appropriate encoding, where required
encoding potentially unknown.

The underlying problem here is in part, albeit maybe from convention,
that there is a single dictionary for both request information and
user configuration. It isn't though a simple matter of splitting them
either so that request information is always separate. This is because
for FASTCGI, SCGI, CGI, you can't split them as only one grouping in
those cases.

This is why I specifically asked previously, and which no one has
answered, if bytes is to be used, which variables in WSGI environment
should be passed as bytes. If there is a known specified list of
variables which it is known will always be bytes, may be more
manageable. If someone is going to suggest that only CGI variables
should be bytes, then what does that actually mean. Remember that for
FASTCGI, SCGI, CGI there isn't really a distinction and so where the
boundary is as to what is a CGI variable is fuzzy although you could
reverse transformation and get back bytes if know what to do it for.

One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
QUERY_STRING and maybe that will suffice. It may not though, because
what about headers such as HTTP_REFERRER? Also, what about additional
SSL_? variables that a SSL module for web sever may add?

Graham

 By choosing bytes, WSGI passes the encoding problem to the application,
 which is good. Let's the application deal with that. It's more likely to
 know what it needs, and what problem it can ignore. I think that 99% of
 the time, applications will just decode bytes to string using UTF-8,
 ignoring invalid values.

 However it's likely that we'll see middlewares converting ALL
 environment values to UTF-8, because it's more convienient than using
 bytes. And some middlewares might depend on `environ` values being
 string instead of bytes, because it's convenient too.


 This issue was already raised by Graham. And I think it's important to
 make it clear. I believe that 'server/CGI' values in the environment
 shouldn't be modified--Of course it should still be possible to add new
 values. This way the stack will always remain in a 'sane' state.

 For example if a middleware wants to convert environ values to UTF-8, it
 shouldn't do that:

   for key, value in environ.items():
       environ[key] = str(value)

 But something like this--assuming there's only bytes in `environ`:

   environ['unicode.environ'] = dict((key, str(value, encoding='utf8'))
                                     for key, value in environ.items())

 I'm in favor of using bytes everywhere. But it's important to document
 why bytes are used and how to use them. I'm not sure this should be
 included in a PEP, maybe a WSGI best practices?


 Cheers,

 --
  Henry Prêcheur

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Robert Brewer
Graham Dumpleton wrote:
 So, for WSGI 1.0 style of interface and Python 3.0, the following is
 what I was going to implement.

FWIW, I'll answer with what we've implemented for CherryPy 3.2.

 1. When running under Python 3, applications SHOULD produce bytes
 output, status line and headers.

Yup.

 2. When running under Python 3, servers and gateways MUST accept
 strings for output, status line and headers. Such strings must be
 converted to bytes output using 'latin-1'. If string cannot be
 converted then is treated as an error.

Yes.

 3. When running under Python 3, servers MUST provide wsgi.input as a
 binary (byte) input stream.

Boy howdy.

 4. When running under Python 3, servers MUST provide a text stream for
 wsgi.errors. In converting this to a byte stream for writing to a
 file, the default encoding would be applied.

I'll look into it.

 5. When running under Python 3, servers MUST provide CGI HTTP and
 server variables as strings. Where such values are sourced from a byte
 string, be that a Python byte string or C string, they should be
 converted as 'UTF-8'. If a specific web server infrastructure is able
 to support different encodings, then the WSGI adapter MAY provide a
 way for a user of the WSGI adapter to customise on a global basis, or
 on a per value basis what encoding is used, but this is entirely
 optional. Note that there is no requirement to deal with RFC 2047.

We're passing unicode for almost everything.

REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must 
be ascii-decodable. So are SERVER_PROTOCOL and our custom 
ACTUAL_SERVER_PROTOCOL entries.

The original bytes of the Request-URI are stored in REQUEST_URI. However, 
PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable 
charset, defaulting to UTF-8. If the path cannot be decoded with that charset, 
ISO-8859-1 is tried. Whichever is successful is stored at 
environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. 
Our origin server always sets SCRIPT_NAME to '', but if we populated it, we 
would make it decoded by the same charset.

All request headers are decoded via ISO-8859-1, which can't fail. Applications 
are expected to transcode these values if they believe them to be in another 
encoding.

 This is where I am going to diverge from what has been discussed before.
 
 The reason I am going to pass as UTF-8 and not latin-1 is that it
 looks like Apache effectively only supports use of UTF-8. Since this
 means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
 even CGI likely cannot handle anything besides UTF-8 then I really
 can't see the point of trying to cater for a theoretical possibility
 that some HTTP client could use something besides UTF-8. In other
 words, the predominant case will be UTF-8, so let us target that.

That is predominant for the Request-URI, and we are defaulting to utf-8 for 
that as I mentioned above. I believe I demonstrated in 
http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 
cannot be the predominant encoding for request headers, which are instead 
mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to 
ISO-8859-1.

 So, rather than burden every WSGI application with the need to convert
 from latin-1 back to bytes and then to UTF-8, let the server deal with
 it, with server using sensible default, and where server
 infrastructure can handle a different encoding, then it can provide
 option to use that encoding and WSGI application doesn't need to
 change.

If there are indeed more headers which are ISO-8859-1, then that same argument 
cuts both ways.

I have no problem doing the same thing here as we do for PATH_INFO: a 
configurable charset, or better yet a list of charsets to try in order, with a 
sensible default, even UTF-8 would be fine. Regardless of the default, if it is 
configurable, then the successful encoding should be put in a canonical environ 
entry so apps can transcode it if the server got it wrong.

Re:bytes. We really do not want the server to set any of the above environ 
entries (except REQUEST_URI) to bytes. I'm surprised those of you who have 
substantial numbers of WSGI middleware aren't fighting this; it would mean 
decoding the same environ entries every time you switched middleware providers. 
Some of you said as much at PyCon: 
http://mail.python.org/pipermail/web-sig/2009-March/003701.html


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Ian Bicking
On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer fuman...@aminus.org wrote:

   5. When running under Python 3, servers MUST provide CGI HTTP and
  server variables as strings. Where such values are sourced from a byte
  string, be that a Python byte string or C string, they should be
  converted as 'UTF-8'. If a specific web server infrastructure is able
  to support different encodings, then the WSGI adapter MAY provide a
  way for a user of the WSGI adapter to customise on a global basis, or
  on a per value basis what encoding is used, but this is entirely
  optional. Note that there is no requirement to deal with RFC 2047.

 We're passing unicode for almost everything.

 REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
 must be ascii-decodable. So are SERVER_PROTOCOL and our custom
 ACTUAL_SERVER_PROTOCOL entries.

 The original bytes of the Request-URI are stored in REQUEST_URI. However,
 PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
 configurable charset, defaulting to UTF-8. If the path cannot be decoded
 with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
 environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
 needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
 it, we would make it decoded by the same charset.


My understanding is that PATH_INFO *should* be UTF-8 regardless of what
encoding a page might be in.  At least that's what I got when testing
Firefox.  It might not be valid UTF-8 if it was manually constructed, but
then there's little reason to think it is valid anything; only the bytes or
REQUEST_URI are likely to be an accurate representation.  (Frankly I wish
PATH_INFO was not url-decoded, which would remove this issue entirely --
REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
know of reasonable cases where this wouldn't be true.)

I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
used to kind of reconstruct the original request path (the surrogateescape
or whatever it is called would serve the same purpose, but is only available
in Python 3).

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Ian Bicking i...@colorstudy.com:
 On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer fuman...@aminus.org wrote:

  5. When running under Python 3, servers MUST provide CGI HTTP and
  server variables as strings. Where such values are sourced from a byte
  string, be that a Python byte string or C string, they should be
  converted as 'UTF-8'. If a specific web server infrastructure is able
  to support different encodings, then the WSGI adapter MAY provide a
  way for a user of the WSGI adapter to customise on a global basis, or
  on a per value basis what encoding is used, but this is entirely
  optional. Note that there is no requirement to deal with RFC 2047.

 We're passing unicode for almost everything.

 REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
 must be ascii-decodable. So are SERVER_PROTOCOL and our custom
 ACTUAL_SERVER_PROTOCOL entries.

 The original bytes of the Request-URI are stored in REQUEST_URI. However,
 PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
 configurable charset, defaulting to UTF-8. If the path cannot be decoded
 with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
 environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
 needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
 it, we would make it decoded by the same charset.

 My understanding is that PATH_INFO *should* be UTF-8 regardless of what
 encoding a page might be in. At least that's what I got when testing
 Firefox.  It might not be valid UTF-8 if it was manually constructed, but
 then there's little reason to think it is valid anything; only the bytes or
 REQUEST_URI are likely to be an accurate representation.

As I understood it, PJE was suggesting that wasn't the case.

For example, what about case where URL appears for target of form POST
and the encoding of that form page wasn't UTF-8. What is the browser
going to send in that case.

Or is this the sort of case you have tested and qualify as saying if
manually constructed anything could happen?

 (Frankly I wish
 PATH_INFO was not url-decoded, which would remove this issue entirely --
 REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
 know of reasonable cases where this wouldn't be true.)
 I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
 used to kind of reconstruct the original request path (the surrogateescape
 or whatever it is called would serve the same purpose, but is only available
 in Python 3).

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Graham Dumpleton
Ian, know you have seen this before, but didn't realise you hadn't
cc'd the list. I have added a new response to part 4 of what you
originally sent that wasn't in first reply that went direct to you.

2009/8/4 Ian Bicking i...@colorstudy.com:
 On Mon, Aug 3, 2009 at 7:38 PM, Graham
 Dumpletongraham.dumple...@gmail.com wrote:
 So, for WSGI 1.0 style of interface and Python 3.0, the following is
 what I was going to implement.

 1. When running under Python 3, applications SHOULD produce bytes
 output, status line and headers.

 Sure.

 This is effectively what we had before. The only difference is that
 clarify that the 'status line' values should also be bytes. This
 wasn't noted before. I had already updated the proposed WSGI 1.0
 amendments page to mention this.

 2. When running under Python 3, servers and gateways MUST accept
 strings for output, status line and headers. Such strings must be
 converted to bytes output using 'latin-1'. If string cannot be
 converted then is treated as an error.

 This is again what we had before except that mention 'status line' value.

 Sure.  ASCII for the status would be acceptable, as I believe that is
 an HTTP constraint.

 3. When running under Python 3, servers MUST provide wsgi.input as a
 binary (byte) input stream.

 No change here.

 Yep.

 4. When running under Python 3, servers MUST provide a text stream for
 wsgi.errors. In converting this to a byte stream for writing to a
 file, the default encoding would be applied.

 No real change here except to clarify that default encoding would
 apply. Use of default encoding though could be problematic if
 combining different WSGI components. This is because each WSGI
 component may have been developed on system with different default
 encoding and so one may expect to log characters that can't be written
 on a different setup. Not sure how you could solve that except to say
 people have default encoding be UTF-8 for portability.

 Sure.  We might specify that the server should never give an encoding
 error; it should use 'replace' or something to make sure it won't
 fail.  Maybe it should be specified what should happen when bytes are
 received.  I generally believe that error handling code should try to
 be as robust as possible, so it shouldn't fail regardless of what it
 is given.

Not that it matters, but looks like that for Apache/mod_wsgi
wsgi.errors should be an instance of io.TextIOWrapper wrapping
internal mod_wsgi specific buffer object providing interface
compatible with io.BufferedIOBase. If someone uses write() on wrapper
with bytes it will fail:

  TypeError: write() argument 1 must be str, not bytes

If someone use print() to output data, then bytes would be converted
okay. That is:

  print(b'1234', file=environ['wsgi.errors'])

yields:

  b'1234'.

If 'replace' is used for errors, you do end up with data loss. Use of
'xmlcharrefreplace' at least preserves values as numbers, although for
Apache at least, if use 'ascii' encoding, you get a bit of a mess as
the backslashes get escaped again.

\\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10

instead of original:

\u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10

That is because Apache logging functions escape anything which isn't
printable ASCII and in turn escapes backslash denoting escaped
character.

If use encoding of utf-8 instead, then byte values get passed and
Apache logging functions then just escape the non printable bytes
instead so all up looks nicer.

\xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
\xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90

So for Apache/mod_wsgi at least, best thing to do seems to use
'replace' and 'utf-8' due to way that Apache error logging functions
work.

I guess the point from this is that possibly should specify that
wsgi.errors should be an instance of io.TextIOWrapper. A specific
implementation should not use 'strict', but use 'replace' or
'backslashreplace' as makes sense, dependent on what encoding it needs
to use and how any underlying logging system it overlays works. The
intent overall being to preserve as much of raw information as
possible.

 5. When running under Python 3, servers MUST provide CGI HTTP and
 server variables as strings. Where such values are sourced from a byte
 string, be that a Python byte string or C string, they should be
 converted as 'UTF-8'. If a specific web server infrastructure is able
 to support different encodings, then the WSGI adapter MAY provide a
 way for a user of the WSGI adapter to customise on a global basis, or
 on a per value basis what encoding is used, but this is entirely
 optional. Note that there 

Re: [Web-SIG] WSGI 2

2009-08-04 Thread Etienne Robillard
Graham Dumpleton wrote:
 Ian, know you have seen this before, but didn't realise you hadn't
 cc'd the list. I have added a new response to part 4 of what you
 originally sent that wasn't in first reply that went direct to you.
 
 2009/8/4 Ian Bicking i...@colorstudy.com:
 On Mon, Aug 3, 2009 at 7:38 PM, Graham
 Dumpletongraham.dumple...@gmail.com wrote:
 So, for WSGI 1.0 style of interface and Python 3.0, the following is
 what I was going to implement.

 1. When running under Python 3, applications SHOULD produce bytes
 output, status line and headers.
 Sure.

 This is effectively what we had before. The only difference is that
 clarify that the 'status line' values should also be bytes. This
 wasn't noted before. I had already updated the proposed WSGI 1.0
 amendments page to mention this.

 2. When running under Python 3, servers and gateways MUST accept
 strings for output, status line and headers. Such strings must be
 converted to bytes output using 'latin-1'. If string cannot be
 converted then is treated as an error.

 This is again what we had before except that mention 'status line' value.
 Sure.  ASCII for the status would be acceptable, as I believe that is
 an HTTP constraint.

 3. When running under Python 3, servers MUST provide wsgi.input as a
 binary (byte) input stream.

 No change here.
 Yep.

 4. When running under Python 3, servers MUST provide a text stream for
 wsgi.errors. In converting this to a byte stream for writing to a
 file, the default encoding would be applied.

 No real change here except to clarify that default encoding would
 apply. Use of default encoding though could be problematic if
 combining different WSGI components. This is because each WSGI
 component may have been developed on system with different default
 encoding and so one may expect to log characters that can't be written
 on a different setup. Not sure how you could solve that except to say
 people have default encoding be UTF-8 for portability.
 Sure.  We might specify that the server should never give an encoding
 error; it should use 'replace' or something to make sure it won't
 fail.  Maybe it should be specified what should happen when bytes are
 received.  I generally believe that error handling code should try to
 be as robust as possible, so it shouldn't fail regardless of what it
 is given.
 
 Not that it matters, but looks like that for Apache/mod_wsgi
 wsgi.errors should be an instance of io.TextIOWrapper wrapping
 internal mod_wsgi specific buffer object providing interface
 compatible with io.BufferedIOBase. If someone uses write() on wrapper
 with bytes it will fail:
 
   TypeError: write() argument 1 must be str, not bytes
 
 If someone use print() to output data, then bytes would be converted
 okay. That is:
 
   print(b'1234', file=environ['wsgi.errors'])
 
 yields:
 
   b'1234'.
 
 If 'replace' is used for errors, you do end up with data loss. Use of
 'xmlcharrefreplace' at least preserves values as numbers, although for
 Apache at least, if use 'ascii' encoding, you get a bit of a mess as
 the backslashes get escaped again.
 
 \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10
 
 instead of original:
 
 \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10
 
 That is because Apache logging functions escape anything which isn't
 printable ASCII and in turn escapes backslash denoting escaped
 character.
 
 If use encoding of utf-8 instead, then byte values get passed and
 Apache logging functions then just escape the non printable bytes
 instead so all up looks nicer.
 
 \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
 \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90
 
 So for Apache/mod_wsgi at least, best thing to do seems to use
 'replace' and 'utf-8' due to way that Apache error logging functions
 work.
 
 I guess the point from this is that possibly should specify that
 wsgi.errors should be an instance of io.TextIOWrapper. A specific
 implementation should not use 'strict', but use 'replace' or
 'backslashreplace' as makes sense, dependent on what encoding it needs
 to use and how any underlying logging system it overlays works. The
 intent overall being to preserve as much of raw information as
 possible.
 
 5. When running under Python 3, servers MUST provide CGI HTTP and
 server variables as strings. Where such values are sourced from a byte
 string, be that a Python byte string or C string, they should be
 converted as 'UTF-8'. If a specific web server infrastructure is able
 to support different encodings, then the WSGI adapter MAY provide a
 way for a user of the WSGI adapter to customise on a global basis, or
 on a per value 

Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote:

2009/8/4 P.J. Eby p...@telecommunity.com:
 I'm not clear on your logic here.  If I request foo/bar/baz (where baz
 actually has an accent over the 'a') in latin-1 encoding, and 
foo/bar is the

 script, then the (accented) baz is legitimate for pass-through to the
 application, no?

Technically, but what I am pointing out is that Apache pretty well
says that foo/bar needs to be UTF-8.


Which doesn't change the fact that you haven't yet proposed what a 
WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and 
QUERY_STRING.  Apache can and does pass through such bytes, so the 
spec needs to say what we do with them.




 If you are going to have
different parts of the one URL needing a different encoding to be
understood, personally I would say you asking for trouble. So, am
saying that UTF-8 needs to really apply more for sake of sanity and
portability.


So what, precisely, are you proposing should happen when such bytes 
are present?




So I guess the problem is more where URLs are already % encoded when
coming back as href or form action because they may be in an encoding
incompatible with UTF-8 if it were to be clicked on.


Yep, that's the case with standard browsers and servers; 
less-standard situations such as spiders and scripts generating or 
following URLs are also relevant, as are deliberate hack 
attempts.  So having the result of this behavior be undefined is a bad thing.




The Apache server at least will decode those % escape sequence and I
believe it is the result of that which is used in stuff like rewrite
rule matches, not the raw URL. The only exception would be if rewrite
rule explicit matched against REQUEST_URI variable which still
contains % escape sequences. So if not in UTF-8, means effectively
that you can't then match them with Apache rewrite rules then.


That's got nothing to do with what you propose for WSGI to do with 
the rest of it, though.


(However, your belief may be incorrect in any event, as this page:

   http://www.dracos.co.uk/code/apache-rewrite-problem/

claims that mod_rewrite can RewriteCond on THE_REQUEST in order to 
match still-encoded paths.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:

In summary, what are the practical uses cases that would make passing
bytes over UTF-8 or even latin-1 worthwhile?


My concern at this point is a nagging feeling that we are abandoning 
WSGI-HTTP equivalence for convenience in the face of changes in 
Python's defaults.  Had Python 3 been the standard version in 
existence when WSGI 1 was created, I would've argued for making 
*everything* bytes, in order to:


1. Force all encodings to be explicit, and
2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects)

And this is why the original spec said that Unicode strings should be 
treated as bytes -- because byte strings were always the original 
target of the spec.


Please remember that WSGI is not primarily intended to provide 
application developers with a convenient API; its first and most 
important job is to ship the data around without mangling it in the process.


HTTP moves bytes, therefore WSGI should move bytes.  For practical 
reasons, it would be good to *also* support strings on the 
application side, especially for application migration.  However, I 
see no reason to make *servers* provide decoded strings instead of bytes.


So I would ask, what is the practical use case for having the server 
decode bytes into strings, instead of leaving them as bytes?


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Bennett
On Tue, Aug 4, 2009 at 11:05 AM, P.J. Ebyp...@telecommunity.com wrote:
 1. Force all encodings to be explicit, and

This can be handled without forcing application authors to work with
bytestrings (or forcing them to remember to coerce to bytestrings
before returning responses).

 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python
 objects)

TBH, WSGI doesn't expose enough of HTTP's functionality to convince me
that this is a good argument. When I can use advanced HTTP features
(chunked transfer and friends) from a WSGI app, maybe I'll feel
differently.

 Please remember that WSGI is not primarily intended to provide application
 developers with a convenient API; its first and most important job is to
 ship the data around without mangling it in the process.

Which it should try very hard to do without forcing *in*convenient
APIs onto developers.

 So I would ask, what is the practical use case for having the server decode
 bytes into strings, instead of leaving them as bytes?

Well, Django (for one example) already does some gymnastics to ensure
that character encoding issues are kept at the request/response
boundary, largely because it's an utter pain for an application
developer to have an API dump a bunch of bytestrings in your lap and
say here, *you* figure it out. I suspect we're going to keep on
doing that, since it's a big win in terms of usability for application
developers (who end up having to deal with only a drastically-reduced
subset of character-encoding problems).


-- 
Bureaucrat Conrad, you are technically correct -- the best kind of correct.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Jim Fulton
On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote:
 At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:

 In summary, what are the practical uses cases that would make passing
 bytes over UTF-8 or even latin-1 worthwhile?

 My concern at this point is a nagging feeling that we are abandoning
 WSGI-HTTP equivalence for convenience in the face of changes in Python's
 defaults.  Had Python 3 been the standard version in existence when WSGI 1
 was created, I would've argued for making *everything* bytes, in order to:

 1. Force all encodings to be explicit, and
 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python
 objects)

 And this is why the original spec said that Unicode strings should be
 treated as bytes -- because byte strings were always the original target of
 the spec.

 Please remember that WSGI is not primarily intended to provide application
 developers with a convenient API; its first and most important job is to
 ship the data around without mangling it in the process.

 HTTP moves bytes, therefore WSGI should move bytes.  For practical reasons,
 it would be good to *also* support strings on the application side,
 especially for application migration.  However, I see no reason to make
 *servers* provide decoded strings instead of bytes.

+1

I haven't had enough time to follow this and earlier encoding
discussions and so haven't commented up to now, but I've always been
uncomfortable with WSGI using anything but bytes or assuming any
encoding.  I agree that application frameworks should deal with
conversion between bytes and unicode.

Jim

-- 
Jim Fulton
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Y Knight


On Aug 4, 2009, at 12:38 PM, James Bennett wrote:


TBH, WSGI doesn't expose enough of HTTP's functionality to convince me
that this is a good argument. When I can use advanced HTTP features
(chunked transfer and friends) from a WSGI app, maybe I'll feel
differently.


But that works just fine today. Your WSGI app sends streaming data  
back using the iterator functionality, and the server automatically  
turns it into chunks if it's talking to an HTTP 1.1 client. What's the  
problem?


James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jim Fulton wrote:
 On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote:
 At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:
 In summary, what are the practical uses cases that would make passing
 bytes over UTF-8 or even latin-1 worthwhile?
 My concern at this point is a nagging feeling that we are abandoning
 WSGI-HTTP equivalence for convenience in the face of changes in Python's
 defaults.  Had Python 3 been the standard version in existence when WSGI 1
 was created, I would've argued for making *everything* bytes, in order to:

 1. Force all encodings to be explicit, and
 2. Ensure WSGI-HTTP equivalence (i.e., WSGI==HTTP encoded in Python
 objects)

 And this is why the original spec said that Unicode strings should be
 treated as bytes -- because byte strings were always the original target of
 the spec.

 Please remember that WSGI is not primarily intended to provide application
 developers with a convenient API; its first and most important job is to
 ship the data around without mangling it in the process.

 HTTP moves bytes, therefore WSGI should move bytes.  For practical reasons,
 it would be good to *also* support strings on the application side,
 especially for application migration.  However, I see no reason to make
 *servers* provide decoded strings instead of bytes.
 
 +1
 
 I haven't had enough time to follow this and earlier encoding
 discussions and so haven't commented up to now, but I've always been
 uncomfortable with WSGI using anything but bytes or assuming any
 encoding.  I agree that application frameworks should deal with
 conversion between bytes and unicode.

+1 from me as well.  The fact that Python3 now calls 'string' what used
to be 'unicode' doesn't change the fact that transport-level
operations have to be done in bytes.  It should be the framework /
application's job to handle conversion of byte inputs from the request
onto strings, and string response fields onto bytes:  ideally, the
framework will do this in a way which keeps the application writer
blissfully ignorant of the distinction.

Note that I think Python3 gets the os.evniron bit wrong for exactly the
same reasons:  I think anybody wanting to use the
environment-as-provided-by-the-OS should deal in bytes (or whatever the
OS provides), with a convenience wrapper for those who don't care about
the difference.  I lost that argument, but that doesn't mean I was wrong. :)


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKeHLg+gerLs4ltQ4RAiFjAJ9uZIkfxwh5w1aYiEdIpr+2yQ+iBwCeJiFM
eUfWBoPwyzwHThkMwd24SZE=
=lod9
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Bennett
On Tue, Aug 4, 2009 at 11:54 AM, James Y Knightf...@fuhm.net wrote:
 But that works just fine today. Your WSGI app sends streaming data back
 using the iterator functionality, and the server automatically turns it into
 chunks if it's talking to an HTTP 1.1 client. What's the problem?

No, it doesn't work just fine today. Either the server has to assume
that every response from that application should be chunked (which is
wrong), or the application needs a way to tell the server to chunk.
Turns out HTTP has a way to indicate that, but WSGI outright forbids
its use. So instead you have to invent out-of-band mechanisms for the
application to tell the server what to do, and in the process reinvent
part of HTTP.


-- 
Bureaucrat Conrad, you are technically correct -- the best kind of correct.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread André Malo
* Graham Dumpleton wrote: 


 Now, the reason why Apache can't really handle anything besides UTF-8
 relates to how filenames are encoded in the file system.

 Taking Windows first as it is the more obvious case. What Apache does
 there is take whatever path it has mapping to a script file, be it
 constructed partially from what is in Apache configuration and
 partially from what was supplied in URL from client, and converts it
 to UCS2 for passing to Windows file system routines. In converting to
 UCS2, Apache assumes that the path will be UTF-8. This means that the
 Apache configuration file has to be UTF-8 and that the URL as supplied
 by the client is UTF-8 as well after any URL character encoding is
 decoded. End result, can only handle UTF-8.

This is the only platform where the apache does that, actually, because it 
doesn't work any other way on windows (everything is passed to the system 
as ucs-2). So I wouldn't call that apache requires utf-8 everywhere. If I 
would care, I would even make it configurable on windows, but I don't ;)

[...]

nd
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread André Malo
* Jim Fulton wrote: 


 On Tue, Aug 4, 2009 at 12:05 PM, P.J. Ebyp...@telecommunity.com wrote:
 
  HTTP moves bytes, therefore WSGI should move bytes.  For practical
  reasons, it would be good to *also* support strings on the application
  side, especially for application migration.  However, I see no reason
  to make *servers* provide decoded strings instead of bytes.

 +1

 I haven't had enough time to follow this and earlier encoding
 discussions and so haven't commented up to now, but I've always been
 uncomfortable with WSGI using anything but bytes or assuming any
 encoding.  I agree that application frameworks should deal with
 conversion between bytes and unicode.

Another +1 from the peanut gallery.

nd
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Robert Brewer
James Bennett wrote:
 On Tue, Aug 4, 2009 at 11:54 AM, James Y Knightf...@fuhm.net wrote:
 But that works just fine today. Your WSGI app sends streaming data back
 using the iterator functionality, and the server automatically turns it into
 chunks if it's talking to an HTTP 1.1 client. What's the problem?
 
 No, it doesn't work just fine today. Either the server has to assume
 that every response from that application should be chunked (which is
 wrong), or the application needs a way to tell the server to chunk.
 Turns out HTTP has a way to indicate that, but WSGI outright forbids
 its use. So instead you have to invent out-of-band mechanisms for the
 application to tell the server what to do, and in the process reinvent
 part of HTTP.

It doesn't have to be out of band; CherryPy's wsgiserver will send a response 
chunked if the application provides no Content-Length response header.

if status == 413:
# Request Entity Too Large. Close conn to avoid garbage.
self.close_connection = True
elif content-length not in hkeys:
# All 1xx (informational), 204 (no content),
# and 304 (not modified) responses MUST NOT
# include a message-body. So no point chunking.
if status  200 or status in (204, 205, 304):
pass
else:
if self.response_protocol == 'HTTP/1.1':
# Use the chunked transfer-coding
self.chunked_write = True
self.outheaders.append((Transfer-Encoding, chunked))
else:
# Closing the conn is the only way to determine len.
self.close_connection = True


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Y Knight


On Aug 4, 2009, at 8:53 PM, Graham Dumpleton wrote:


2. How would use of bytes work for a CGI-WSGI bridge given that
os.environ is not bytes? Where does one get what encoding was used for
os.environ values so it can be converted back to bytes?


On Unix it's simple enough:
On py2.X on Unix: environ is bytes already.
On py3.0: you're screwed, because some env vars were discarded already.
On py3.1+: 'string'.encode(sys.getfilesystemencoding(),  
'surrogateescape') should do it.


On Windows, I guess the OS environment is unicode, so, I don't know  
precisely what to do to reversibly obtain the bytes sent from the end- 
users's browser. It looks to me from source code as if Apache will  
encode the bytes from the client (utf-8 or otherwise!) as the Unicode  
values 0x00 to 0xFF in the windows environment, that is, as if  
decoding the client input in latin-1. But it does that for the  
following keys only:

HTTP_*
SERVER_*
REQUEST_*
QUERY_STRING
PATH_INFO
PATH_TRANSLATED
(from 
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/arch/win32/mod_win32.c)

Other values are decoded from utf-8 (or, if passed through from an  
enclosing environment, passed through untouched -- via encoding into  
utf-8 for internal use and then decoding back from utf-8 to put back  
in the Windows environment.)


I'll note that while it's important to get this transformation correct  
for a CGI-WSGI bridge to work right in Windows, and thus is  
definitely a useful discussion to have here, it doesn't actually need  
to be part of the WSGI spec.


James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 10:53 AM 8/5/2009 +1000, Graham Dumpleton wrote:

Now, the main reason why I am throwing around alternate suggestions in
the first place is that last time although people seem to be
comfortable moving along with the idea of latin-1 everywhere, I knew
of some who weren't happy with that, some not on the list, and who
believed it should be bytes, but they weren't speaking up.


I suspect that this was all a confusion to begin with; the primary 
function of Latin-1 in WSGI has been a way to represent bytes when 
all you have to represent them with is unicode strings.  So, even 
when we've been talking Latin-1, what we really mean is bytes.  ;-)


In general, I think we want to require that servers must provide 
bytes, and accept both bytes and Latin-1 (maybe just ASCII?) 
strings.  (I don't see a problem with environ keys being strings, 
though, since all the WSGI or CGI-defined keys are pure ASCII 
anyway.  But I could just as easily go with bytes everywhere; I 
assume Py3 treats all-ascii byte strings and the equivalent unicode 
as being equal and hashing alike.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2

2009-08-03 Thread Ian Bicking
So... what about WSGI 2?  Let's not completely drop the ball on this.
I *think* we were largely in agreement; debate got distracted by some
async stuff, but I don't think we particularly have to deal with that
for WSGI 2.  I think we do more than enough if we figure out: WSGI in
Python 3, i.e., with unicode; some basic errata kind of stuff, like
readline signature; change the callable signature to remove
start_response.

Would this be a new PEP or a revision?  I think it should be a new
PEP, as WSGI 1 remains valid and the same as it always was, and PEP
333 describes that.  Is there anyone willing to make the revisions?

-- 
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote:

Would this be a new PEP or a revision?  I think it should be a new
PEP, as WSGI 1 remains valid and the same as it always was, and PEP
333 describes that.


+1 for a new PEP, since we'd be able to drop a lot of crufty examples 
and explanations about the cruddy bits.  wsgiref should add 1-2 and 
2-1 adapters.  (Although technically, running a WSGI 1 application 
in a WSGI 2 server requires either threads or greenlets.)


IMO, the main benefit of implementing WSGI 2 is to applications, not 
servers, with the possible exception of async servers (e.g. Twisted) 
that would prefer an iterator-only communications mode.  Such servers 
could refactor their WSGI 1 support into a (thread or greenlet-based) 
WSGI 2-1 adapter.


Synchronous servers, OTOH, might as well stay WSGI 1, and simply use 
a standard 1-2 adapter to support WSGI 2.



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Graham Dumpleton
2009/8/4 Mark Ramm mark.mchristen...@gmail.com:
 In summary, just seems more sane to have stuff in WSGI environment be
 dealt with as UTF-8.

 This sounds good to me.   Rack, Jack, and even java servlets seem to
 make this assumption without significant trouble, and if nearly all
 existing web servers do it internally, that's seems like an even
 better argument.

What do they do for response side though? Do they have the
bytes/string distinct that we are talking about, with bytes expected
by string accepted but only in representable as latin-1?

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 10:48 AM 8/4/2009 +1000, Graham Dumpleton wrote:

2009/8/4 P.J. Eby p...@telecommunity.com:
 At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote:

 Would this be a new PEP or a revision?  I think it should be a new
 PEP, as WSGI 1 remains valid and the same as it always was, and PEP
 333 describes that.

 +1 for a new PEP, since we'd be able to drop a lot of crufty examples and
 explanations about the cruddy bits.  wsgiref should add 1-2 and 2-1
 adapters.  (Although technically, running a WSGI 1 application in a WSGI 2
 server requires either threads or greenlets.)

 IMO, the main benefit of implementing WSGI 2 is to applications, not
 servers, with the possible exception of async servers (e.g. Twisted) that
 would prefer an iterator-only communications mode.  Such servers could
 refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2-1
 adapter.

 Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a
 standard 1-2 adapter to support WSGI 2.

Personally I don't believe we should be trying to support async
servers in the WSGI specification.


I'm not suggesting adding anything for async servers; I'm just saying 
that they will likely prefer to use WSGI 2 and use a 2-1 adapter to 
do WSGI 1 support, whereas synchronous servers will likely prefer the reverse.


The WSGI spec doesn't currently require streaming upload support, so 
if an async server wants to buffer the input (e.g. to a temp file) 
rather than trusting the application to handle reads, it's free to do 
so.  (And that's independent of whether it's WSGI 1 or 2 being used.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote:

1. When running under Python 3, applications SHOULD produce bytes
output, status line and headers.

This is effectively what we had before. The only difference is that
clarify that the 'status line' values should also be bytes. This
wasn't noted before. I had already updated the proposed WSGI 1.0
amendments page to mention this.


+1



2. When running under Python 3, servers and gateways MUST accept
strings for output, status line and headers. Such strings must be
converted to bytes output using 'latin-1'. If string cannot be
converted then is treated as an error.

This is again what we had before except that mention 'status line' value.

3. When running under Python 3, servers MUST provide wsgi.input as a
binary (byte) input stream.

No change here.

4. When running under Python 3, servers MUST provide a text stream for
wsgi.errors. In converting this to a byte stream for writing to a
file, the default encoding would be applied.

No real change here except to clarify that default encoding would
apply. Use of default encoding though could be problematic if
combining different WSGI components. This is because each WSGI
component may have been developed on system with different default
encoding and so one may expect to log characters that can't be written
on a different setup. Not sure how you could solve that except to say
people have default encoding be UTF-8 for portability.


Also +1.



5. When running under Python 3, servers MUST provide CGI HTTP and
server variables as strings. Where such values are sourced from a byte
string, be that a Python byte string or C string, they should be
converted as 'UTF-8'. If a specific web server infrastructure is able
to support different encodings, then the WSGI adapter MAY provide a
way for a user of the WSGI adapter to customise on a global basis, or
on a per value basis what encoding is used, but this is entirely
optional. Note that there is no requirement to deal with RFC 2047.

This is where I am going to diverge from what has been discussed before.

The reason I am going to pass as UTF-8 and not latin-1 is that it
looks like Apache effectively only supports use of UTF-8. Since this
means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
even CGI likely cannot handle anything besides UTF-8 then I really
can't see the point of trying to cater for a theoretical possibility
that some HTTP client could use something besides UTF-8. In other
words, the predominant case will be UTF-8, so let us target that.

So, rather than burden every WSGI application with the need to convert
from latin-1 back to bytes and then to UTF-8, let the server deal with
it, with server using sensible default, and where server
infrastructure can handle a different encoding, then it can provide
option to use that encoding and WSGI application doesn't need to
change.


Maybe I'm missing something here, but what if Apache receives 
something encoded in Latin-1?  AFAIR, form POST encoding is 
determined by the encoding of the page containing the form; that's of 
course something that only happens in the input body, but what about URLs?


Mainly I'm wondering, what should the server do in the event they 
receive a byte string which is not valid UTF-8?  (Latin-1 doesn't 
have this problem, since there's no such thing as an invalid Latin-1 
string, at least not at the encoding level.)




Also shown though that SCRIPT_NAME part has to be UTF-8
and we would really be entering fantasy land if you were somehow going
to cope with some different encoding for PATH_INFO and QUERY_STRING.
Instead it is like the GPL, viral in nature. Use of UTF-8 in one
particular area means you are effectively bound to use UTF-8
everywhere else.


I'm not clear on your logic here.  If I request foo/bar/baz (where 
baz actually has an accent over the 'a') in latin-1 encoding, and 
foo/bar is the script, then the (accented) baz is legitimate for 
pass-through to the application, no?


I just tried testing this with Firefox and Apache, and found that you 
can in fact pass such Latin-1 strings through to PATH_INFO, but at 
least in the case of Firefox, you have to %-escape them.  However, 
they are seen by Python (via os.environ) as latin-1 encoded byte strings.




Further example of why UTF-8 reaches into everything is mod_rewrite
module for Apache. This allows you to do stuff related to SCRIPT_NAME,
PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
configuration file has to be UTF-8. If URL isn't, then wouldn't be
possible to perform matches against non latin-1 characters in a
rewrite condition or rule. This is because your match string would be
in different encoded form to that in URL and so wouldn't match.


Note that this still doesn't have any impact on the bytes that 
actually reach the application, which can be non-UTF8.  At minimum, 
the proposal is underspecified as to how to handle this case, which 
is as trivial to generate as