Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-20 Thread Henry Precheur
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote:
> However, we quite often use only a portion of the URI when attempting
> to locate an appropriate handler; sometimes just the leading "/"
> character! The remaining characters are often passed as function
> arguments to the handler, or stuck in some parameter list/dict. In
> many cases, the charset used to decode these values either: is
> unimportant; follows complex rules from one resource to another; or is
> merely reencoded, since the application really does care about bytes
> and not characters. Falling back to ISO-8859-1 (and minting a new WSGI
> environ entry to declare the charset which was used to decode) can
> handle all of these cases. Server configuration options cannot, at
> least not without their specification becoming unwieldy.

(Just to make things clear, I am not just talking about REQUEST_URI
here, but all request headers)


Encoding everything using ISO-8859-1 has the nice property of keeping
informations intact. It would be good heuristic if everything with a few
exceptions was encoded using ISO-8859-1. Just transcode the few
problematic cases at the application level and everybody is happy. A
string encoded from ISO-8859-1 is like a bytes object with a string
'interface' on top of it.


But it sweep the encoding problem under the carpet. The problem with
Python 2 was that str and unicode were almost the same, so much the same
that it was possible to mix them without too much problems:

  >>> 'foo' == u'foo'
  True

Python 3 made bytes and string 'incompatible' to force programmers to
handle the encoding problem as soon as possible:

  >>> b'foo' == 'foo'
  False

By passing `str()` to the application, the application author could
believe that the encoding problem has been handled. But in most cases it
hasn't been handled at all. The application author should still
transcode all the strings incorrectly encoded. We are back to Python 2's
bad old days, where we can't be sure that what we got is properly
encoded:

  Was that string encoded using latin-1? Maybe a middleware transcoded
  it to UTF-8 before the application was called. Maybe the application
  itself transcoded it at some point, but then we need to keep track of
  what was transcoded. Maybe the application should transcode everything
  when it is called.

Also EVERY application author will have to read the PEP, especially the
paragraph saying:

  > Everything we give you are strings, but you still have to deal
  > with the encoding mess.

Otherwise he will have weird problems like when he was using Python 2.
Because the interface is not clear. strings are supposed to be text and
only text. Encoding everything to ISO-8859-1 means strings are not text
anymore, they are 'encoded data' [1].


bytes are supposed to be 'encoded data' and binary blobs. By giving
applications bytes, the author knows right away he should decode them.
No need to read the PEP.


`bytes` can do everything `str` can do with the notable exception of
'format'.

  >>> b'foo bar'.title()
  b'Foo Bar'

  >>> b'/foo/bar/fran\xc3ois'.split(b'/')
  [b'', b'foo', b'bar', b'fran\xc3ois']

  >>> re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups()
  (b'foo', b'1234')

I understand that `bytes()` is an unfamiliar beast. But I believe the
encoding problem is the realm of the application, not the realm of the
gateway. Let the application handle the encoding problem and don't give
it a half baked solution.


Using bytes also has its set of problems. The standard library doesn't
support bytes very well. For example urllib.response.unquote() doesn't
work with bytes, and urllib.parse too has issues.

[1] 
http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread P.J. Eby

At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote:

Did I say 3 reasons? I meant 4: Accept-Charset.


Chief amongst the reasons...  amongst our reasonry...  Right, we'll 
come in again.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread Robert Brewer
I wrote:
> Applications do produce URI's (and IRI's, etc. that need to be
> converted into URI's) and do transfer them in media types like
> HTML, which define how to encode a.href's and form.action's
> before %-encoding them [4]. But these are not the only vectors
> by which clients obtain or generate Request-URI's.
> ...
> As someone (Alan Kennedy?) noted at PyCon, static resources may
> depend upon a filename encoding defined by the OS which is
> different than that of the rest of the URI's generated/understood
> by even the most coherent application.
> ...
> "In practical terms, character-by-character comparisons should be
> done codepoint-by-codepoint after conversion to a common character
> encoding." In other words, the URI spec seems to imply that the
> two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the former
> is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF"
> encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about
> this, since all environ values must be byte strings. IMO WSGI
> 2 should do better in this regard.
> ...
> For the three reasons above, I don't think we can assume that the
> application will always receive equivalent URI's encoded in a
> single, foreseen encoding.

Did I say 3 reasons? I meant 4: Accept-Charset.


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-16 Thread Robert Brewer
I wrote:
> PATH_INFO and QUERY_STRING are ... decoded via a configurable
> charset, defaulting to UTF-8. If the path cannot be decoded
> with that charset, ISO-8859-1 is tried. Whichever is successful
> is stored at environ['REQUEST_URI_ENCODING'] so middleware and
> apps can transcode if needed.

and Ian replied:
> My understanding is that PATH_INFO *should* be UTF-8 regardless of
> what encoding a page might be in.  At least that's what I got when
> testing Firefox.  It might not be valid UTF-8 if it was manually
> constructed, but then there's little reason to think it is valid...

Actually, current browsers tend to use UTF-8 for the path, and either the 
encoding of the document [1] or Windows-1252 [2] for the querystring. But the 
vast majority of HTTP user agents are not browsers [3]. Even if that were not 
so, we should not define WSGI to only interoperate with the most current 
browsers.

and Graham added:
> Thinking about it for a while, I get the feel that having a fallback
> to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
> URLs wouldn't consistently use the same encoding all the time just
> seems wrong. I would see it as returning a bad request status. If an
> application coder knows they are actually going to be dealing with
> latin-1, as that is how the application is written, then they should
> be specifying it should be latin-1 always instead of utf-8. Thus, the
> WSGI adapter should provide a means to override what encoding is used.

Applications do produce URI's (and IRI's, etc. that need to be converted into 
URI's) and do transfer them in media types like HTML, which define how to 
encode a.href's and form.action's before %-encoding them [4]. But these are not 
the only vectors by which clients obtain or generate Request-URI's.

> For simple WSGI adapters which only service one WGSI application, then
> it would apply to whole URL namespace.

As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a 
filename encoding defined by the OS which is different than that of the rest of 
the URI's generated/understood by even the most coherent application.

The encoding used for a URI is only really important for one reason: URI 
comparison. Comparison is at the heart of handler dispatch, static resource 
identification, and proper HTTP cache operation. It is for these reasons that 
RFC 3986 has an extensive section on the matter [5], including a "ladder" of 
approaches:

 * Simple String Comparison
 * Case Normalization (e.g. /a%3D == /a%3d)
 * Percent-Encoding Normalization (e.g. /a%62c == /abc)
 * Path Segment Normalization (e.g. /abc/../def == /def)
 * Scheme-Based Normalization (e.g. http://example.com == 
http://example.com:80/)
 * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing 
showed it to be)

I think it would be beneficial to those who develop WSGI application interfaces 
to be able to assume that at least case-, percent-, path-, and 
scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by 
all WSGI 2 origin servers.

All of those except for the first one can be accomplished without decoding the 
target URI. But that first section specifically states: "In practical terms, 
character-by-character comparisons should be done codepoint-by-codepoint after 
conversion to a common character encoding." In other words, the URI spec seems 
to imply that the two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the 
former is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF" encoded in 
ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ 
values must be byte strings. IMO WSGI 2 should do better in this regard.

> For something like Apache where
> could map to multiple WSGI applications, then it may want to provide
> means of overriding encoding for specific subsets o URLs, ie., using
> Location directive for example.

For the three reasons above, I don't think we can assume that the application 
will always receive equivalent URI's encoded in a single, foreseen encoding. 
Yet we still haven't answered the question of how to handle unforeseen 
encodings. You're right that, if the server-side stack as a whole cannot map a 
particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 
over 400, but either is fine.

However, we quite often use only a portion of the URI when attempting to locate 
an appropriate handler; sometimes just the leading "/" character! The remaining 
characters are often passed as function arguments to the handler, or stuck in 
some parameter list/dict. In many cases, the charset used to decode these 
values either: is unimportant; follows complex rules from one resource to 
another; or is merely reencoded, since the application really does care about 
bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI 
environ entry to declare the charset which was used to decode) can handle all 
of these cases. Server confi

Re: [Web-SIG] WSGI 2

2009-08-16 Thread Graham Dumpleton
2009/8/12 Ian Bicking :
> On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer  wrote:
>>
>> > 5. When running under Python 3, servers MUST provide CGI HTTP and
>> > server variables as strings. Where such values are sourced from a byte
>> > string, be that a Python byte string or C string, they should be
>> > converted as 'UTF-8'. If a specific web server infrastructure is able
>> > to support different encodings, then the WSGI adapter MAY provide a
>> > way for a user of the WSGI adapter to customise on a global basis, or
>> > on a per value basis what encoding is used, but this is entirely
>> > optional. Note that there is no requirement to deal with RFC 2047.
>>
>> We're passing unicode for almost everything.
>>
>> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
>> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
>> ACTUAL_SERVER_PROTOCOL entries.
>>
>> The original bytes of the Request-URI are stored in REQUEST_URI. However,
>> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
>> configurable charset, defaulting to UTF-8. If the path cannot be decoded
>> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
>> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
>> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
>> it, we would make it decoded by the same charset.
>
> My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> encoding a page might be in.  At least that's what I got when testing
> Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> then there's little reason to think it is valid anything; only the bytes or
> REQUEST_URI are likely to be an accurate representation.  (Frankly I wish
> PATH_INFO was not url-decoded, which would remove this issue entirely --
> REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
> know of reasonable cases where this wouldn't be true.)
> I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
> used to kind of reconstruct the original request path (the surrogateescape
> or whatever it is called would serve the same purpose, but is only available
> in Python 3).

Thinking about it for a while, I get the feel that having a fallback
to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
URLs wouldn't consistently use the same encoding all the time just
seems wrong. I would see it as returning a bad request status. If an
application coder knows they are actually going to be dealing with
latin-1, as that is how the application is written, then they should
be specifying it should be latin-1 always instead of utf-8. Thus, the
WSGI adapter should provide a means to override what encoding is used.
For simple WSGI adapters which only service one WGSI application, then
it would apply to whole URL namespace. For something like Apache where
could map to multiple WSGI applications, then it may want to provide
means of overriding encoding for specific subsets o URLs, ie., using
Location directive for example.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-13 Thread Henry Precheur
On Wed, Aug 12, 2009 at 12:05:40AM -0500, Ian Bicking wrote:
> Correct -- you can write any set of % encodings, and I don't think it even
> has to be able to validly url-decode (e.g., /foo%zzz will work).  It
> definitely doesn't have to be a valid encoding.  However, if you actually
> include unicode characters, they will always be encoded as UTF-8 (as goes
> with the IRI standard).  This is in a case like , the
> browser will request /some%20page, because it escapes unsafe characters.
>  Similarly if you request  it will encode that ?? in
> UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at
> least on Firefox.  I used this to test:
> http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py

I have run some tests regarding the encoding issue:

curl doesn't 'url-encode' its URLs:

  curl 'http://hostname/fran?ais'
^
  latin-1 character

The latin-1 character is send to the server. Lighttpd accepts the URL
and even return a file if it exists. Of course if I try with the same
characters in UTF-8 it doesn't work.

AFAIK RFC 2396 forbid non-ASCII characters in URLs. The problem is that
libcurl is quite popular (it used to be the transport library of
Webkit/GTK+ for example.) It's hard to discard it as a utterly broken &
obscure tool. Many 'simplistic' HTTP clients may have the same problem.


Now let's talk a little bit about cookies...

Cookies can contain whatever 'binary junk' the server send. RFC 2965
says (http://tools.ietf.org/html/rfc2965#page-5):

> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.

Also, cookies can contain 'comments' which contains UTF-8 strings.
(http://tools.ietf.org/html/rfc2965#page-6):

> Characters in value MUST be in UTF-8 encoding.

Firefox has no problem with cookies containing non-ASCII characters. It
looks like it assumes cookies are encoded using latin-1, since latin-1
characters are displayed correctly in Firebug, but not UTF-8 ones.


Cheers,

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Ian Bicking
On Tue, Aug 11, 2009 at 11:58 PM, Graham Dumpleton <
graham.dumple...@gmail.com> wrote:

> 2009/8/12 Ian Bicking :
> > On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer 
> wrote:
> >>
> >> > 5. When running under Python 3, servers MUST provide CGI HTTP and
> >> > server variables as strings. Where such values are sourced from a byte
> >> > string, be that a Python byte string or C string, they should be
> >> > converted as 'UTF-8'. If a specific web server infrastructure is able
> >> > to support different encodings, then the WSGI adapter MAY provide a
> >> > way for a user of the WSGI adapter to customise on a global basis, or
> >> > on a per value basis what encoding is used, but this is entirely
> >> > optional. Note that there is no requirement to deal with RFC 2047.
> >>
> >> We're passing unicode for almost everything.
> >>
> >> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
> >> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
> >> ACTUAL_SERVER_PROTOCOL entries.
> >>
> >> The original bytes of the Request-URI are stored in REQUEST_URI.
> However,
> >> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
> >> configurable charset, defaulting to UTF-8. If the path cannot be decoded
> >> with that charset, ISO-8859-1 is tried. Whichever is successful is
> stored at
> >> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
> >> needed. Our origin server always sets SCRIPT_NAME to '', but if we
> populated
> >> it, we would make it decoded by the same charset.
> >
> > My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> > encoding a page might be in. At least that's what I got when testing
> > Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> > then there's little reason to think it is valid anything; only the bytes
> or
> > REQUEST_URI are likely to be an accurate representation.
>
> As I understood it, PJE was suggesting that wasn't the case.
>
> For example, what about case where URL appears for target of form POST
> and the encoding of that form page wasn't UTF-8. What is the browser
> going to send in that case.
>
> Or is this the sort of case you have tested and qualify as saying if
> manually constructed anything could happen?
>

Correct -- you can write any set of % encodings, and I don't think it even
has to be able to validly url-decode (e.g., /foo%zzz will work).  It
definitely doesn't have to be a valid encoding.  However, if you actually
include unicode characters, they will always be encoded as UTF-8 (as goes
with the IRI standard).  This is in a case like , the
browser will request /some%20page, because it escapes unsafe characters.
 Similarly if you request  it will encode that ç in
UTF-8, then url-encode it, even if the page itself is ISO-8859-1.  Well, at
least on Firefox.  I used this to test:
http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Ian Bicking :
> On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer  wrote:
>>
>> > 5. When running under Python 3, servers MUST provide CGI HTTP and
>> > server variables as strings. Where such values are sourced from a byte
>> > string, be that a Python byte string or C string, they should be
>> > converted as 'UTF-8'. If a specific web server infrastructure is able
>> > to support different encodings, then the WSGI adapter MAY provide a
>> > way for a user of the WSGI adapter to customise on a global basis, or
>> > on a per value basis what encoding is used, but this is entirely
>> > optional. Note that there is no requirement to deal with RFC 2047.
>>
>> We're passing unicode for almost everything.
>>
>> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
>> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
>> ACTUAL_SERVER_PROTOCOL entries.
>>
>> The original bytes of the Request-URI are stored in REQUEST_URI. However,
>> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
>> configurable charset, defaulting to UTF-8. If the path cannot be decoded
>> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
>> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
>> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
>> it, we would make it decoded by the same charset.
>
> My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> encoding a page might be in. At least that's what I got when testing
> Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> then there's little reason to think it is valid anything; only the bytes or
> REQUEST_URI are likely to be an accurate representation.

As I understood it, PJE was suggesting that wasn't the case.

For example, what about case where URL appears for target of form POST
and the encoding of that form page wasn't UTF-8. What is the browser
going to send in that case.

Or is this the sort of case you have tested and qualify as saying if
manually constructed anything could happen?

> (Frankly I wish
> PATH_INFO was not url-decoded, which would remove this issue entirely --
> REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
> know of reasonable cases where this wouldn't be true.)
> I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
> used to kind of reconstruct the original request path (the surrogateescape
> or whatever it is called would serve the same purpose, but is only available
> in Python 3).

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Ian Bicking
On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer  wrote:

>  > 5. When running under Python 3, servers MUST provide CGI HTTP and
> > server variables as strings. Where such values are sourced from a byte
> > string, be that a Python byte string or C string, they should be
> > converted as 'UTF-8'. If a specific web server infrastructure is able
> > to support different encodings, then the WSGI adapter MAY provide a
> > way for a user of the WSGI adapter to customise on a global basis, or
> > on a per value basis what encoding is used, but this is entirely
> > optional. Note that there is no requirement to deal with RFC 2047.
>
> We're passing unicode for almost everything.
>
> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
> ACTUAL_SERVER_PROTOCOL entries.
>
> The original bytes of the Request-URI are stored in REQUEST_URI. However,
> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
> configurable charset, defaulting to UTF-8. If the path cannot be decoded
> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
> it, we would make it decoded by the same charset.
>

My understanding is that PATH_INFO *should* be UTF-8 regardless of what
encoding a page might be in.  At least that's what I got when testing
Firefox.  It might not be valid UTF-8 if it was manually constructed, but
then there's little reason to think it is valid anything; only the bytes or
REQUEST_URI are likely to be an accurate representation.  (Frankly I wish
PATH_INFO was not url-decoded, which would remove this issue entirely --
REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
know of reasonable cases where this wouldn't be true.)

I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
used to kind of reconstruct the original request path (the surrogateescape
or whatever it is called would serve the same purpose, but is only available
in Python 3).

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Robert Brewer
Graham Dumpleton wrote:
> So, for WSGI 1.0 style of interface and Python 3.0, the following is
> what I was going to implement.

FWIW, I'll answer with what we've implemented for CherryPy 3.2.

> 1. When running under Python 3, applications SHOULD produce bytes
> output, status line and headers.

Yup.

> 2. When running under Python 3, servers and gateways MUST accept
> strings for output, status line and headers. Such strings must be
> converted to bytes output using 'latin-1'. If string cannot be
> converted then is treated as an error.

Yes.

> 3. When running under Python 3, servers MUST provide wsgi.input as a
> binary (byte) input stream.

Boy howdy.

> 4. When running under Python 3, servers MUST provide a text stream for
> wsgi.errors. In converting this to a byte stream for writing to a
> file, the default encoding would be applied.

I'll look into it.

> 5. When running under Python 3, servers MUST provide CGI HTTP and
> server variables as strings. Where such values are sourced from a byte
> string, be that a Python byte string or C string, they should be
> converted as 'UTF-8'. If a specific web server infrastructure is able
> to support different encodings, then the WSGI adapter MAY provide a
> way for a user of the WSGI adapter to customise on a global basis, or
> on a per value basis what encoding is used, but this is entirely
> optional. Note that there is no requirement to deal with RFC 2047.

We're passing unicode for almost everything.

REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must 
be ascii-decodable. So are SERVER_PROTOCOL and our custom 
ACTUAL_SERVER_PROTOCOL entries.

The original bytes of the Request-URI are stored in REQUEST_URI. However, 
PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable 
charset, defaulting to UTF-8. If the path cannot be decoded with that charset, 
ISO-8859-1 is tried. Whichever is successful is stored at 
environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. 
Our origin server always sets SCRIPT_NAME to '', but if we populated it, we 
would make it decoded by the same charset.

All request headers are decoded via ISO-8859-1, which can't fail. Applications 
are expected to transcode these values if they believe them to be in another 
encoding.

> This is where I am going to diverge from what has been discussed before.
> 
> The reason I am going to pass as UTF-8 and not latin-1 is that it
> looks like Apache effectively only supports use of UTF-8. Since this
> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
> even CGI likely cannot handle anything besides UTF-8 then I really
> can't see the point of trying to cater for a theoretical possibility
> that some HTTP client could use something besides UTF-8. In other
> words, the predominant case will be UTF-8, so let us target that.

That is predominant for the Request-URI, and we are defaulting to utf-8 for 
that as I mentioned above. I believe I demonstrated in 
http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 
cannot be the predominant encoding for request headers, which are instead 
mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to 
ISO-8859-1.

> So, rather than burden every WSGI application with the need to convert
> from latin-1 back to bytes and then to UTF-8, let the server deal with
> it, with server using sensible default, and where server
> infrastructure can handle a different encoding, then it can provide
> option to use that encoding and WSGI application doesn't need to
> change.

If there are indeed more headers which are ISO-8859-1, then that same argument 
cuts both ways.

I have no problem doing the same thing here as we do for PATH_INFO: a 
configurable charset, or better yet a list of charsets to try in order, with a 
sensible default, even UTF-8 would be fine. Regardless of the default, if it is 
configurable, then the successful encoding should be put in a canonical environ 
entry so apps can transcode it if the server got it wrong.

Re:bytes. We really do not want the server to set any of the above environ 
entries (except REQUEST_URI) to bytes. I'm surprised those of you who have 
substantial numbers of WSGI middleware aren't fighting this; it would mean 
decoding the same environ entries every time you switched middleware providers. 
Some of you said as much at PyCon: 
http://mail.python.org/pipermail/web-sig/2009-March/003701.html


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Henry Precheur :
> On Wed, Aug 12, 2009 at 09:25:21AM +1000, Graham Dumpleton wrote:
>> Use of bytes everywhere can be inconvenient on the gateway/server
>> side, at least as far as end result for user.
>
> Yes, but wouldn't it be simpler for mod_wsgi to only deal with bytes?
> unicode C strings -> bytes and char* -> bytes conversions seem
> straightforward.

Programming at C code level it doesn't really make any difference as
pretty well same amount of C API calls. All the code is also already
written for this in mod_wsgi and configurable to be done any which way
so people could play with different alternatives. When decision
actually made, just need to make that decision be the default. Only
extra complexity comes from where subset of WSGI environment should be
bytes and to make that at least somewhat easier, need simple well
defined rule and that where if first character of variable name is
uppercase letter, then use bytes, might be reasonable. Anything more
complicated may be a pain.

> But char* -> string doesn't look easy to do, since you have to 'guess'
> the encoding.

Only for stuff that derives from HTTP request, which is the argument
for using bytes and leave it up to application to decide. For user
custom variables, then would be UTF-8 as that is what Apache
effectively treats configuration file as being.

> This is suppositions, I have never worked on WSGI server/gateway.

Which is the same for most people and perhaps why many don't want to
wade into this argument. That is, attitude is that it is a problem for
those who want to write hosting adapters and not an issue for
application developers. Reality is that it needs to be guided by
application developers as they are the ones who have to work with
whatever interface is defined.

Graham

> Correct me if I'm wrong.
>
>> The specific problem is that WSGI environment is used to hold
>> information about the original request, as CGI variables, but also can
>> hold user specified custom variables.
>>
>> In the case of anything hosted via Apache, such as through mod_wsgi,
>> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
>> custom variables using the SetEnv directive. Thus one might say:
>>
>>   SetEnv trac.env_path /usr/local/trac/site-1
>>
>> If the rule is that everything in WSGI environment coming from WSGI
>> adapter must be bytes then you have a potential for mismatch in
>> expectations of how values will be passed. That is, if set using
>> SetEnv then would be bytes, but if set using WSGI middleware wrapper
>> for configuration, more likely going to be string. It would seem
>> overly onerous to expect WSGI middleware to use bytes for
>> configuration variables as well and so force all consumers to always
>> be converting to string using appropriate encoding, where required
>> encoding potentially unknown.
>
> Is it reasonable to expect configuration variable to have a certain
> type? I am tempted to say 'no', but that's because I like the "everything
> is bytes" approach so much :) I don't have any experience with
> configuration variables passed via the WSGI environment though.
>
> But it could be quite a problem, for example 'Developer authentication'
> posted a month ago by Ian Bicking requires its configuration variable to
> be a string, but I don't think this spec applies to WSGI on Py3K or WSGI
> 2.
>
>> This is why I specifically asked previously, and which no one has
>> answered, if bytes is to be used, which variables in WSGI environment
>> should be passed as bytes. If there is a known specified list of
>> variables which it is known will always be bytes, may be more
>> manageable. If someone is going to suggest that only CGI variables
>> should be bytes, then what does that actually mean. Remember that for
>> FASTCGI, SCGI, CGI there isn't really a distinction and so where the
>> boundary is as to what is a CGI variable is fuzzy although you could
>> reverse transformation and get back bytes if know what to do it for.
>>
>> One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
>> QUERY_STRING and maybe that will suffice. It may not though, because
>> what about headers such as HTTP_REFERRER? Also, what about additional
>> SSL_? variables that a SSL module for web sever may add?
>
> What you are proposing in 'black-listing' some variables known to cause
> problems.
>
> It will be difficult to come up with an exhaustive list of variables
> with different encoding. Even if we were able to come up with such a
> list, it creates 2 different cases and could end up complicate
> application developer's life. That's why the approach "everything coming
> from the server/gateway is bytes" makes sense, it is simpler to explain,
> it is simpler to understand, and it's, I think, more pythonic (There
> should be one-- and preferably only one --obvious way to do it.)
>
> Just consider the case of cookies, I don't know if you can use non-ASCII
> character in them, but it possible that it will mess up "ev

Re: [Web-SIG] WSGI 2

2009-08-11 Thread Henry Precheur
On Wed, Aug 12, 2009 at 09:25:21AM +1000, Graham Dumpleton wrote:
> Use of bytes everywhere can be inconvenient on the gateway/server
> side, at least as far as end result for user.

Yes, but wouldn't it be simpler for mod_wsgi to only deal with bytes?
unicode C strings -> bytes and char* -> bytes conversions seem
straightforward.

But char* -> string doesn't look easy to do, since you have to 'guess'
the encoding.

This is suppositions, I have never worked on WSGI server/gateway.
Correct me if I'm wrong.

> The specific problem is that WSGI environment is used to hold
> information about the original request, as CGI variables, but also can
> hold user specified custom variables.
> 
> In the case of anything hosted via Apache, such as through mod_wsgi,
> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
> custom variables using the SetEnv directive. Thus one might say:
> 
>   SetEnv trac.env_path /usr/local/trac/site-1
> 
> If the rule is that everything in WSGI environment coming from WSGI
> adapter must be bytes then you have a potential for mismatch in
> expectations of how values will be passed. That is, if set using
> SetEnv then would be bytes, but if set using WSGI middleware wrapper
> for configuration, more likely going to be string. It would seem
> overly onerous to expect WSGI middleware to use bytes for
> configuration variables as well and so force all consumers to always
> be converting to string using appropriate encoding, where required
> encoding potentially unknown.

Is it reasonable to expect configuration variable to have a certain
type? I am tempted to say 'no', but that's because I like the "everything
is bytes" approach so much :) I don't have any experience with
configuration variables passed via the WSGI environment though.

But it could be quite a problem, for example 'Developer authentication'
posted a month ago by Ian Bicking requires its configuration variable to
be a string, but I don't think this spec applies to WSGI on Py3K or WSGI
2.

> This is why I specifically asked previously, and which no one has
> answered, if bytes is to be used, which variables in WSGI environment
> should be passed as bytes. If there is a known specified list of
> variables which it is known will always be bytes, may be more
> manageable. If someone is going to suggest that only CGI variables
> should be bytes, then what does that actually mean. Remember that for
> FASTCGI, SCGI, CGI there isn't really a distinction and so where the
> boundary is as to what is a CGI variable is fuzzy although you could
> reverse transformation and get back bytes if know what to do it for.
> 
> One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
> QUERY_STRING and maybe that will suffice. It may not though, because
> what about headers such as HTTP_REFERRER? Also, what about additional
> SSL_? variables that a SSL module for web sever may add?

What you are proposing in 'black-listing' some variables known to cause
problems.

It will be difficult to come up with an exhaustive list of variables
with different encoding. Even if we were able to come up with such a
list, it creates 2 different cases and could end up complicate
application developer's life. That's why the approach "everything coming
from the server/gateway is bytes" makes sense, it is simpler to explain,
it is simpler to understand, and it's, I think, more pythonic (There
should be one-- and preferably only one --obvious way to do it.)

Just consider the case of cookies, I don't know if you can use non-ASCII
character in them, but it possible that it will mess up "everything is
string expect a, b, c" if we forget to include it in the list.
"Everything is bytes" is in this sense more future-proof than
"black-listing a, b, c". If a variable with a weird encoding appears a
few month after the new PEP is released, "everything is bytes" still
works, but the "black-list" approach stops working.


Cheers,

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Ian Bicking :
> On Tue, Aug 11, 2009 at 6:25 PM, Graham Dumpleton
>  wrote:
>>
>> 2009/8/12 Henry Precheur :
>> > Using bytes for all `environ` values is easy to understand on the
>> > application side as long as you are aware of the encoding problem. The
>> > cost is inconvenience, but that's probably OK. It's also simpler to
>> > implement on the gateway/server side.
>>
>> Use of bytes everywhere can be inconvenient on the gateway/server
>> side, at least as far as end result for user.
>>
>> The specific problem is that WSGI environment is used to hold
>> information about the original request, as CGI variables, but also can
>> hold user specified custom variables.
>>
>> In the case of anything hosted via Apache, such as through mod_wsgi,
>> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
>> custom variables using the SetEnv directive. Thus one might say:
>>
>>  SetEnv trac.env_path /usr/local/trac/site-1
>
> Just to clarify, there specifically is no type restrictions on extension
> variables, which is any variable with a "." in it.  The type restrictions
> are solely for ALL_CAPS keys.  You can put ints or unicode or whatever in
> other variables.  (Probably this doesn't make things any easier for
> mod_wsgi, though; at least for this example)

If you want to change what the specification says from:

"""Finally, the environ dictionary may also contain server-defined
variables. These variables should be named using only lower-case
letters, numbers, dots, and underscores, and should be prefixed with a
name that is unique to the defining server or gateway."""

to:

"""Finally, the environ dictionary may also contain server-defined
variables. These variables MUST be named using only lower-case
letters, numbers, dots, and underscores, and should be prefixed with a
name that is unique to the defining server or gateway."""

then it is part the way as it least one is drawing a line between what
is being construed as CGI variable and so would be bytes, and
adapter/application variables which would be converted to string in
what ever encoding makes sense for the server configuration system,
with in the case of Apache would be UTF-8.

The above description though would also have to be changed though, in
as much as at the moment it says:

"""should be prefixed with a name that is unique to the defining
server or gateway"""

This isn't really in practice correct as the server configuration is
just providing the mechanism for setting them and they may not
necessarily be server or gateway variables, but variables a user is
setting to customise the behaviour of the application. The way I read
that line, strictly speaking, even though set as:

  SetEnv trac.env_path /usr/local/trac/site-1

it should be passed through as:

  mod_wsgi.trac.env_path

which would be rather silly. Thus description needs to cater for fact
that application variables may be settable from server configuration
and passed through as is.

Anyway, if the rule is that anything in upper case is treated as CGI
and passed as bytes, and anything in lower case isn't and is passed as
string, appropriately decoded, then that would eliminate one confusion
point as far as expectations. It may not make it any easier for CGI
under Python 3.0 though, where values would be all strings anyway.

Now, is anyone willing to address the problem pointed out by others
about where being able to return either bytes or strings (latin-1) for
response headers is a pain for WSGI middleware to deal with?

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Ian Bicking
On Tue, Aug 11, 2009 at 6:25 PM, Graham Dumpleton <
graham.dumple...@gmail.com> wrote:

> 2009/8/12 Henry Precheur :
> > Using bytes for all `environ` values is easy to understand on the
> > application side as long as you are aware of the encoding problem. The
> > cost is inconvenience, but that's probably OK. It's also simpler to
> > implement on the gateway/server side.
>
> Use of bytes everywhere can be inconvenient on the gateway/server
> side, at least as far as end result for user.
>
> The specific problem is that WSGI environment is used to hold
> information about the original request, as CGI variables, but also can
> hold user specified custom variables.
>
> In the case of anything hosted via Apache, such as through mod_wsgi,
> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
> custom variables using the SetEnv directive. Thus one might say:
>
>  SetEnv trac.env_path /usr/local/trac/site-1
>

Just to clarify, there specifically is no type restrictions on extension
variables, which is any variable with a "." in it.  The type restrictions
are solely for ALL_CAPS keys.  You can put ints or unicode or whatever in
other variables.  (Probably this doesn't make things any easier for
mod_wsgi, though; at least for this example)

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Graham Dumpleton
2009/8/12 Henry Precheur :
> Using bytes for all `environ` values is easy to understand on the
> application side as long as you are aware of the encoding problem. The
> cost is inconvenience, but that's probably OK. It's also simpler to
> implement on the gateway/server side.

Use of bytes everywhere can be inconvenient on the gateway/server
side, at least as far as end result for user.

The specific problem is that WSGI environment is used to hold
information about the original request, as CGI variables, but also can
hold user specified custom variables.

In the case of anything hosted via Apache, such as through mod_wsgi,
mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
custom variables using the SetEnv directive. Thus one might say:

  SetEnv trac.env_path /usr/local/trac/site-1

If the rule is that everything in WSGI environment coming from WSGI
adapter must be bytes then you have a potential for mismatch in
expectations of how values will be passed. That is, if set using
SetEnv then would be bytes, but if set using WSGI middleware wrapper
for configuration, more likely going to be string. It would seem
overly onerous to expect WSGI middleware to use bytes for
configuration variables as well and so force all consumers to always
be converting to string using appropriate encoding, where required
encoding potentially unknown.

The underlying problem here is in part, albeit maybe from convention,
that there is a single dictionary for both request information and
user configuration. It isn't though a simple matter of splitting them
either so that request information is always separate. This is because
for FASTCGI, SCGI, CGI, you can't split them as only one grouping in
those cases.

This is why I specifically asked previously, and which no one has
answered, if bytes is to be used, which variables in WSGI environment
should be passed as bytes. If there is a known specified list of
variables which it is known will always be bytes, may be more
manageable. If someone is going to suggest that only CGI variables
should be bytes, then what does that actually mean. Remember that for
FASTCGI, SCGI, CGI there isn't really a distinction and so where the
boundary is as to what is a CGI variable is fuzzy although you could
reverse transformation and get back bytes if know what to do it for.

One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
QUERY_STRING and maybe that will suffice. It may not though, because
what about headers such as HTTP_REFERRER? Also, what about additional
SSL_? variables that a SSL module for web sever may add?

Graham

> By choosing bytes, WSGI passes the encoding problem to the application,
> which is good. Let's the application deal with that. It's more likely to
> know what it needs, and what problem it can ignore. I think that 99% of
> the time, applications will just decode bytes to string using UTF-8,
> ignoring invalid values.
>
> However it's likely that we'll see middlewares converting ALL
> environment values to UTF-8, because it's more convienient than using
> bytes. And some middlewares might depend on `environ` values being
> string instead of bytes, because it's convenient too.
>
>
> This issue was already raised by Graham. And I think it's important to
> make it clear. I believe that 'server/CGI' values in the environment
> shouldn't be modified--Of course it should still be possible to add new
> values. This way the stack will always remain in a 'sane' state.
>
> For example if a middleware wants to convert environ values to UTF-8, it
> shouldn't do that:
>
>>   for key, value in environ.items():
>>       environ[key] = str(value)
>
> But something like this--assuming there's only bytes in `environ`:
>
>>   environ['unicode.environ'] = dict((key, str(value, encoding='utf8'))
>>                                     for key, value in environ.items())
>
> I'm in favor of using bytes everywhere. But it's important to document
> why bytes are used and how to use them. I'm not sure this should be
> included in a PEP, maybe a "WSGI best practices"?
>
>
> Cheers,
>
> --
>  Henry Prêcheur
>
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-11 Thread Henry Precheur
Using bytes for all `environ` values is easy to understand on the
application side as long as you are aware of the encoding problem. The
cost is inconvenience, but that's probably OK. It's also simpler to
implement on the gateway/server side.

By choosing bytes, WSGI passes the encoding problem to the application,
which is good. Let's the application deal with that. It's more likely to
know what it needs, and what problem it can ignore. I think that 99% of
the time, applications will just decode bytes to string using UTF-8,
ignoring invalid values.

However it's likely that we'll see middlewares converting ALL
environment values to UTF-8, because it's more convienient than using
bytes. And some middlewares might depend on `environ` values being
string instead of bytes, because it's convenient too.


This issue was already raised by Graham. And I think it's important to
make it clear. I believe that 'server/CGI' values in the environment
shouldn't be modified--Of course it should still be possible to add new
values. This way the stack will always remain in a 'sane' state.

For example if a middleware wants to convert environ values to UTF-8, it
shouldn't do that:

>   for key, value in environ.items():
>   environ[key] = str(value)

But something like this--assuming there's only bytes in `environ`:

>   environ['unicode.environ'] = dict((key, str(value, encoding='utf8'))
> for key, value in environ.items())

I'm in favor of using bytes everywhere. But it's important to document
why bytes are used and how to use them. I'm not sure this should be
included in a PEP, maybe a "WSGI best practices"?


Cheers,

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-06 Thread William Dode

>> HTTP moves bytes, therefore WSGI should move bytes.  For practical reasons,
>> it would be good to *also* support strings on the application side,
>> especially for application migration.  However, I see no reason to make
>> *servers* provide decoded strings instead of bytes.

+1 because anyway if (most of the time) an app decide to reject 
everything not utf-8 it'll be very easy. And if not it will be possible, 
specialy for old applications where we cannot upgrade the server and the 
client in the same time.


-- 
William Dodé - http://flibuste.net
Informaticien Indépendant

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-05 Thread Randy Syring


Tres Seaver wrote:


ideally, the
framework will do this in a way which keeps the application writer
blissfully ignorant of the distinction.
As an application developer, I would like to agree with the above.  I am 
going to rely on a good framework to handle a lot of these issues.  It 
seems that a lot of the discussion, while over my head, assumes that 
application developers are going to be working directly with WSGI.  
Technically, that is possible, but I think you should remember that most 
application developers are going to rely on a framework to give them a 
usable API.  My opinion, as an application developer, would be to keep 
WSGI as clean as possible and allow the frameworks to handle creating a 
good API that gives options for handling byte/character encoding issues. 

Its a lot easier to change/update a framework than a spec.  Keep WSGI as 
simple as possible and let the frameworks manage the more complicated 
aspects of character encoding and clean APIs.


Just my $0.02.

--
Randy Syring
RCS Computers & Web Solutions
502-644-4776
http://www.rcs-comp.com

"Whether, then, you eat or drink or 
whatever you do, do all to the glory

of God." 1 Cor 10:31



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 10:53 AM 8/5/2009 +1000, Graham Dumpleton wrote:

Now, the main reason why I am throwing around alternate suggestions in
the first place is that last time although people seem to be
comfortable moving along with the idea of latin-1 everywhere, I knew
of some who weren't happy with that, some not on the list, and who
believed it should be bytes, but they weren't speaking up.


I suspect that this was all a confusion to begin with; the primary 
function of Latin-1 in WSGI has been a way to represent bytes when 
all you have to represent them with is unicode strings.  So, even 
when we've been talking Latin-1, what we really mean is bytes.  ;-)


In general, I think we want to require that servers must provide 
bytes, and accept both bytes and Latin-1 (maybe just ASCII?) 
strings.  (I don't see a problem with environ keys being strings, 
though, since all the WSGI or CGI-defined keys are pure ASCII 
anyway.  But I could just as easily go with "bytes everywhere"; I 
assume Py3 treats all-ascii byte strings and the equivalent unicode 
as being equal and hashing alike.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Y Knight


On Aug 4, 2009, at 8:53 PM, Graham Dumpleton wrote:


2. How would use of bytes work for a CGI-WSGI bridge given that
os.environ is not bytes? Where does one get what encoding was used for
os.environ values so it can be converted back to bytes?


On Unix it's simple enough:
On py2.X on Unix: environ is bytes already.
On py3.0: you're screwed, because some env vars were discarded already.
On py3.1+: 'string'.encode(sys.getfilesystemencoding(),  
'surrogateescape') should do it.


On Windows, I guess the OS environment is unicode, so, I don't know  
precisely what to do to reversibly obtain the bytes sent from the end- 
users's browser. It looks to me from source code as if Apache will  
encode the bytes from the client (utf-8 or otherwise!) as the Unicode  
values 0x00 to 0xFF in the windows environment, that is, as if  
decoding the client input in latin-1. But it does that for the  
following keys only:

HTTP_*
SERVER_*
REQUEST_*
QUERY_STRING
PATH_INFO
PATH_TRANSLATED
(from 
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/arch/win32/mod_win32.c)

Other values are decoded from utf-8 (or, if passed through from an  
enclosing environment, passed through untouched -- via encoding into  
utf-8 for internal use and then decoding back from utf-8 to put back  
in the Windows environment.)


I'll note that while it's important to get this transformation correct  
for a CGI->WSGI bridge to work right in Windows, and thus is  
definitely a useful discussion to have here, it doesn't actually need  
to be part of the WSGI spec.


James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Graham Dumpleton
2009/8/5 P.J. Eby :
> So what, precisely, are you proposing should happen when such bytes are
> present?

Treat me as a business manager who has read just enough IT magazines
to be dangerous. As I have said before in prior discussions, my area
is C coding and trying to implement a Python hosting solution for
Apache, I do not know all the intricacies of HTTP and web application
development as I don't write web applications. That is why I defer to
you guys to come up with a workable specification. If I don't see
anything sensible coming back in the way of a proposal, I will try an
suggest my own, but because of my lack of knowledge it isn't
necessarily going to be right. In this respect, just pushing it back
on me isn't particularly helpful from my perspective. If you think
something is outright wrong and not going to work, then come back back
with an overall solution which is going to work. So far no one else
has come back with an overall solution that works and everyone is
happy with and I seem to be the only one truly interested in
progressing this. As such it is really frustrating.

Now, the main reason why I am throwing around alternate suggestions in
the first place is that last time although people seem to be
comfortable moving along with the idea of latin-1 everywhere, I knew
of some who weren't happy with that, some not on the list, and who
believed it should be bytes, but they weren't speaking up. In this
discussion people are being more vocal about bytes being the way to go
and I am quite happy with that, we just need to flesh out the various
problems from going that way. So, let us put aside UTF-8 as a workable
solution for Python and focus then on bytes instead. We also need to
address other comments by people about whether status and headers
values in response should come back as bytes or strings to allow
predictability for WSGI middleware.

The questions around use of bytes in my mind are:

1. Should the values of all CGI variables be bytes or just a subset of
them? If a subset, which ones? Note am presuming here the name of
header, ie., key, will be a string and only value will be bytes. Is
that even a correct assumption?

2. How would use of bytes work for a CGI-WSGI bridge given that
os.environ is not bytes? Where does one get what encoding was used for
os.environ values so it can be converted back to bytes?

3. What are the rules about WSGI middleware in respect of preservation
of values as bytes? I can see too easily that people will convert
SCRIPT_NAME and PATH_INFO to string to do stuff with and change them
and then not convert them back to bytes if environ is modified with
new values. The rules would have to be clearly specified.

We then have the issues others have raised about response.

4. Should there be a choice about a WSGI application/middleware
returning bytes or a string which is automatically converted to bytes
per latin-1? If no choice, which should required to be returned, bytes
or strings?

So, lets focus on these issues instead then and any others that people
have in relation to bytes or how responses are returned and so explore
that option.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Robert Brewer
James Bennett wrote:
> On Tue, Aug 4, 2009 at 11:54 AM, James Y Knight wrote:
>> But that works just fine today. Your WSGI app sends streaming data back
>> using the iterator functionality, and the server automatically turns it into
>> chunks if it's talking to an HTTP 1.1 client. What's the problem?
> 
> No, it doesn't work just fine today. Either the server has to assume
> that every response from that application should be chunked (which is
> wrong), or the application needs a way to tell the server to chunk.
> Turns out HTTP has a way to indicate that, but WSGI outright forbids
> its use. So instead you have to invent out-of-band mechanisms for the
> application to tell the server what to do, and in the process reinvent
> part of HTTP.

It doesn't have to be out of band; CherryPy's wsgiserver will send a response 
chunked if the application provides no Content-Length response header.

if status == 413:
# Request Entity Too Large. Close conn to avoid garbage.
self.close_connection = True
elif "content-length" not in hkeys:
# "All 1xx (informational), 204 (no content),
# and 304 (not modified) responses MUST NOT
# include a message-body." So no point chunking.
if status < 200 or status in (204, 205, 304):
pass
else:
if self.response_protocol == 'HTTP/1.1':
# Use the chunked transfer-coding
self.chunked_write = True
self.outheaders.append(("Transfer-Encoding", "chunked"))
else:
# Closing the conn is the only way to determine len.
self.close_connection = True


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread André Malo
* Jim Fulton wrote: 


> On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote:
> >
> > HTTP moves bytes, therefore WSGI should move bytes.  For practical
> > reasons, it would be good to *also* support strings on the application
> > side, especially for application migration.  However, I see no reason
> > to make *servers* provide decoded strings instead of bytes.
>
> +1
>
> I haven't had enough time to follow this and earlier encoding
> discussions and so haven't commented up to now, but I've always been
> uncomfortable with WSGI using anything but bytes or assuming any
> encoding.  I agree that application frameworks should deal with
> conversion between bytes and unicode.

Another +1 from the peanut gallery.

nd
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread André Malo
* Graham Dumpleton wrote: 


> Now, the reason why Apache can't really handle anything besides UTF-8
> relates to how filenames are encoded in the file system.
>
> Taking Windows first as it is the more obvious case. What Apache does
> there is take whatever path it has mapping to a script file, be it
> constructed partially from what is in Apache configuration and
> partially from what was supplied in URL from client, and converts it
> to UCS2 for passing to Windows file system routines. In converting to
> UCS2, Apache assumes that the path will be UTF-8. This means that the
> Apache configuration file has to be UTF-8 and that the URL as supplied
> by the client is UTF-8 as well after any URL character encoding is
> decoded. End result, can only handle UTF-8.

This is the only platform where the apache does that, actually, because it 
doesn't work any other way on windows (everything is passed to the system 
as ucs-2). So I wouldn't call that "apache requires utf-8 everywhere". If I 
would care, I would even make it configurable on windows, but I don't ;)

[...]

nd
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Y Knight

On Aug 4, 2009, at 2:08 PM, James Bennett wrote:

the server has to assume
that every response from that application should be chunked (which is
wrong)


I'd expect the server to chunk every response from the application  
which is returned as an iterable instead of a list. Why do you say  
that's wrong?


James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Bennett
On Tue, Aug 4, 2009 at 11:54 AM, James Y Knight wrote:
> But that works just fine today. Your WSGI app sends streaming data back
> using the iterator functionality, and the server automatically turns it into
> chunks if it's talking to an HTTP 1.1 client. What's the problem?

No, it doesn't work just fine today. Either the server has to assume
that every response from that application should be chunked (which is
wrong), or the application needs a way to tell the server to chunk.
Turns out HTTP has a way to indicate that, but WSGI outright forbids
its use. So instead you have to invent out-of-band mechanisms for the
application to tell the server what to do, and in the process reinvent
part of HTTP.


-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Bill Janssen
P.J. Eby  wrote:

> At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote:
> >2009/8/4 P.J. Eby :
> > > I'm not clear on your logic here.  If I request foo/bar/baz (where baz
> > > actually has an accent over the 'a') in latin-1 encoding, and 
> > foo/bar is the
> > > script, then the (accented) baz is legitimate for pass-through to the
> > > application, no?
> >
> >Technically, but what I am pointing out is that Apache pretty well
> >says that foo/bar needs to be UTF-8.
> 
> Which doesn't change the fact that you haven't yet proposed what a 
> WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and 
> QUERY_STRING.  Apache can and does pass through such bytes, so the 
> spec needs to say what we do with them.

Particularly QUERY_STRING.  The original thinking around urlencoded was
that it was always Latin-1.  You were supposed to use
"multipart/form-data" for non-Latin-1 encodings.  Long thread on
www-talk circa 1994 about this.

I think bytes are the safest way to go here.  It would be nice if we
could automagically detect the correct encoding, but there's no
foolproof way of doing that.

Bill
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jim Fulton wrote:
> On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote:
>> At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:
>>> In summary, what are the practical uses cases that would make passing
>>> bytes over UTF-8 or even latin-1 worthwhile?
>> My concern at this point is a nagging feeling that we are abandoning
>> WSGI<->HTTP equivalence for convenience in the face of changes in Python's
>> defaults.  Had Python 3 been the standard version in existence when WSGI 1
>> was created, I would've argued for making *everything* bytes, in order to:
>>
>> 1. Force all encodings to be explicit, and
>> 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python
>> objects)
>>
>> And this is why the original spec said that Unicode strings should be
>> treated as bytes -- because byte strings were always the original target of
>> the spec.
>>
>> Please remember that WSGI is not primarily intended to provide application
>> developers with a convenient API; its first and most important job is to
>> ship the data around without mangling it in the process.
>>
>> HTTP moves bytes, therefore WSGI should move bytes.  For practical reasons,
>> it would be good to *also* support strings on the application side,
>> especially for application migration.  However, I see no reason to make
>> *servers* provide decoded strings instead of bytes.
> 
> +1
> 
> I haven't had enough time to follow this and earlier encoding
> discussions and so haven't commented up to now, but I've always been
> uncomfortable with WSGI using anything but bytes or assuming any
> encoding.  I agree that application frameworks should deal with
> conversion between bytes and unicode.

+1 from me as well.  The fact that Python3 now calls 'string' what used
to be 'unicode' doesn't change the fact that "transport-level"
operations have to be done in bytes.  It should be the framework /
application's job to handle conversion of byte inputs from the request
onto strings, and string response fields onto bytes:  ideally, the
framework will do this in a way which keeps the application writer
blissfully ignorant of the distinction.

Note that I think Python3 gets the os.evniron bit wrong for exactly the
same reasons:  I think anybody wanting to use the
environment-as-provided-by-the-OS should deal in bytes (or whatever the
OS provides), with a convenience wrapper for those who don't care about
the difference.  I lost that argument, but that doesn't mean I was wrong. :)


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFKeHLg+gerLs4ltQ4RAiFjAJ9uZIkfxwh5w1aYiEdIpr+2yQ+iBwCeJiFM
eUfWBoPwyzwHThkMwd24SZE=
=lod9
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Ian Bicking
On Mon, Aug 3, 2009 at 11:28 PM, Graham
Dumpleton wrote:
>> Mainly I'm wondering, what should the server do in the event they receive a
>> byte string which is not valid UTF-8?  (Latin-1 doesn't have this problem,
>> since there's no such thing as an invalid Latin-1 string, at least not at
>> the encoding level.)
>
> Can you clarify. We aren't talking about request content here. The
> wsgi.input stream is still binary and up to WSGI application to decode
> how it decides it should be decoded.

You could receive something like
  GET /fran%E7ais
which if you do:
  urllib.unquote('/fran%E7ais').decode('utf8')
you will get an error.

So what should the server do?  Obviously anyone at any time can embed
 in a document, and the browser is not going to
try to figure out that encoding, it's just going to follow that URL.

From my testing (in
http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py) the browser
will be consistent about UTF8 when it does the encoding itself; but it
doesn't necessarily do the encoding itself.  QUERY_STRING will *not*
necessarily be UTF8, even when the path is UTF8 (but this doesn't
matter for us, because QUERY_STRING doesn't get url-decoded, so it's
just ASCII with %-encoding).

> The only related thing I can think you are talking about is the form
> target URL, which is an issue for GET and POST requests, or other
> method types, from a form.
>
>>> Also shown though that SCRIPT_NAME part has to be UTF-8
>>> and we would really be entering fantasy land if you were somehow going
>>> to cope with some different encoding for PATH_INFO and QUERY_STRING.
>>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one
>>> particular area means you are effectively bound to use UTF-8
>>> everywhere else.
>>
>> I'm not clear on your logic here.  If I request foo/bar/baz (where baz
>> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the
>> script, then the (accented) baz is legitimate for pass-through to the
>> application, no?
>
> Technically, but what I am pointing out is that Apache pretty well
> says that foo/bar needs to be UTF-8. If you are going to have
> different parts of the one URL needing a different encoding to be
> understood, personally I would say you asking for trouble. So, am
> saying that UTF-8 needs to really apply more for sake of sanity and
> portability.

Apache's limitations can't be encoded into WSGI.  Yes, it won't work
with Apache (I guess, though with ProxyPass / or something, is this a
problem?) -- but the idea of mapping request paths to files has
nothing to do with WSGI.

>> I just tried testing this with Firefox and Apache, and found that you can in
>> fact pass such Latin-1 strings through to PATH_INFO, but at least in the
>> case of Firefox, you have to %-escape them.  However, they are seen by
>> Python (via os.environ) as latin-1 encoded byte strings.
>
> By using % escapes you are in practice overriding the encoding that
> the browser may be applying to URL if given raw character? What
> happens if you were to paste the accented character direct into the
> browser URL bar? Browsers I have played with would normally
> automatically translate that as UTF-8 and send it as such, with %
> encoding as necessary.

Correct; the browser encodes non-ASCII characters as UTF8, but does
not try to inspect the encoding of already %-encoded characters.

> So I guess the problem is more where URLs are already % encoded when
> coming back as href or form action because they may be in an encoding
> incompatible with UTF-8 if it were to be clicked on.
>
>>> Further example of why UTF-8 reaches into everything is mod_rewrite
>>> module for Apache. This allows you to do stuff related to SCRIPT_NAME,
>>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
>>> configuration file has to be UTF-8. If URL isn't, then wouldn't be
>>> possible to perform matches against non latin-1 characters in a
>>> rewrite condition or rule. This is because your match string would be
>>> in different encoded form to that in URL and so wouldn't match.
>>
>> Note that this still doesn't have any impact on the bytes that actually
>> reach the application, which can be non-UTF8.  At minimum, the proposal is
>> underspecified as to how to handle this case, which is as trivial to
>> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s)
>> of a URL.
>
> The Apache server at least will decode those % escape sequence and I
> believe it is the result of that which is used in stuff like rewrite
> rule matches, not the raw URL. The only exception would be if rewrite
> rule explicit matched against REQUEST_URI variable which still
> contains % escape sequences. So if not in UTF-8, means effectively
> that you can't then match them with Apache rewrite rules then.
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/

Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Y Knight


On Aug 4, 2009, at 12:38 PM, James Bennett wrote:


TBH, WSGI doesn't expose enough of HTTP's functionality to convince me
that this is a good argument. When I can use advanced HTTP features
(chunked transfer and friends) from a WSGI app, maybe I'll feel
differently.


But that works just fine today. Your WSGI app sends streaming data  
back using the iterator functionality, and the server automatically  
turns it into chunks if it's talking to an HTTP 1.1 client. What's the  
problem?


James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Jim Fulton
On Tue, Aug 4, 2009 at 12:05 PM, P.J. Eby wrote:
> At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:
>>
>> In summary, what are the practical uses cases that would make passing
>> bytes over UTF-8 or even latin-1 worthwhile?
>
> My concern at this point is a nagging feeling that we are abandoning
> WSGI<->HTTP equivalence for convenience in the face of changes in Python's
> defaults.  Had Python 3 been the standard version in existence when WSGI 1
> was created, I would've argued for making *everything* bytes, in order to:
>
> 1. Force all encodings to be explicit, and
> 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python
> objects)
>
> And this is why the original spec said that Unicode strings should be
> treated as bytes -- because byte strings were always the original target of
> the spec.
>
> Please remember that WSGI is not primarily intended to provide application
> developers with a convenient API; its first and most important job is to
> ship the data around without mangling it in the process.
>
> HTTP moves bytes, therefore WSGI should move bytes.  For practical reasons,
> it would be good to *also* support strings on the application side,
> especially for application migration.  However, I see no reason to make
> *servers* provide decoded strings instead of bytes.

+1

I haven't had enough time to follow this and earlier encoding
discussions and so haven't commented up to now, but I've always been
uncomfortable with WSGI using anything but bytes or assuming any
encoding.  I agree that application frameworks should deal with
conversion between bytes and unicode.

Jim

-- 
Jim Fulton
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread James Bennett
On Tue, Aug 4, 2009 at 11:05 AM, P.J. Eby wrote:
> 1. Force all encodings to be explicit, and

This can be handled without forcing application authors to work with
bytestrings (or forcing them to remember to coerce to bytestrings
before returning responses).

> 2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python
> objects)

TBH, WSGI doesn't expose enough of HTTP's functionality to convince me
that this is a good argument. When I can use advanced HTTP features
(chunked transfer and friends) from a WSGI app, maybe I'll feel
differently.

> Please remember that WSGI is not primarily intended to provide application
> developers with a convenient API; its first and most important job is to
> ship the data around without mangling it in the process.

Which it should try very hard to do without forcing *in*convenient
APIs onto developers.

> So I would ask, what is the practical use case for having the server decode
> bytes into strings, instead of leaving them as bytes?

Well, Django (for one example) already does some gymnastics to ensure
that character encoding issues are kept at the request/response
boundary, largely because it's an utter pain for an application
developer to have an API dump a bunch of bytestrings in your lap and
say "here, *you* figure it out". I suspect we're going to keep on
doing that, since it's a big win in terms of usability for application
developers (who end up having to deal with only a drastically-reduced
subset of character-encoding problems).


-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 10:44 PM 8/4/2009 +1000, Graham Dumpleton wrote:

In summary, what are the practical uses cases that would make passing
bytes over UTF-8 or even latin-1 worthwhile?


My concern at this point is a nagging feeling that we are abandoning 
WSGI<->HTTP equivalence for convenience in the face of changes in 
Python's defaults.  Had Python 3 been the standard version in 
existence when WSGI 1 was created, I would've argued for making 
*everything* bytes, in order to:


1. Force all encodings to be explicit, and
2. Ensure WSGI<->HTTP equivalence (i.e., WSGI==HTTP encoded in Python objects)

And this is why the original spec said that Unicode strings should be 
treated as bytes -- because byte strings were always the original 
target of the spec.


Please remember that WSGI is not primarily intended to provide 
application developers with a convenient API; its first and most 
important job is to ship the data around without mangling it in the process.


HTTP moves bytes, therefore WSGI should move bytes.  For practical 
reasons, it would be good to *also* support strings on the 
application side, especially for application migration.  However, I 
see no reason to make *servers* provide decoded strings instead of bytes.


So I would ask, what is the practical use case for having the server 
decode bytes into strings, instead of leaving them as bytes?


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread P.J. Eby

At 02:28 PM 8/4/2009 +1000, Graham Dumpleton wrote:

2009/8/4 P.J. Eby :
> I'm not clear on your logic here.  If I request foo/bar/baz (where baz
> actually has an accent over the 'a') in latin-1 encoding, and 
foo/bar is the

> script, then the (accented) baz is legitimate for pass-through to the
> application, no?

Technically, but what I am pointing out is that Apache pretty well
says that foo/bar needs to be UTF-8.


Which doesn't change the fact that you haven't yet proposed what a 
WSGI server should *do* with such non-UTF8 bytes in PATH_INFO and 
QUERY_STRING.  Apache can and does pass through such bytes, so the 
spec needs to say what we do with them.




 If you are going to have
different parts of the one URL needing a different encoding to be
understood, personally I would say you asking for trouble. So, am
saying that UTF-8 needs to really apply more for sake of sanity and
portability.


So what, precisely, are you proposing should happen when such bytes 
are present?




So I guess the problem is more where URLs are already % encoded when
coming back as href or form action because they may be in an encoding
incompatible with UTF-8 if it were to be clicked on.


Yep, that's the case with "standard" browsers and servers; 
less-standard situations such as spiders and scripts generating or 
following URLs are also relevant, as are deliberate hack 
attempts.  So having the result of this behavior be undefined is a bad thing.




The Apache server at least will decode those % escape sequence and I
believe it is the result of that which is used in stuff like rewrite
rule matches, not the raw URL. The only exception would be if rewrite
rule explicit matched against REQUEST_URI variable which still
contains % escape sequences. So if not in UTF-8, means effectively
that you can't then match them with Apache rewrite rules then.


That's got nothing to do with what you propose for WSGI to do with 
the rest of it, though.


(However, your belief may be incorrect in any event, as this page:

   http://www.dracos.co.uk/code/apache-rewrite-problem/

claims that mod_rewrite can RewriteCond on THE_REQUEST in order to 
match still-encoded paths.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-04 Thread Etienne Robillard
Graham Dumpleton wrote:
> Ian, know you have seen this before, but didn't realise you hadn't
> cc'd the list. I have added a new response to part 4 of what you
> originally sent that wasn't in first reply that went direct to you.
> 
> 2009/8/4 Ian Bicking :
>> On Mon, Aug 3, 2009 at 7:38 PM, Graham
>> Dumpleton wrote:
>>> So, for WSGI 1.0 style of interface and Python 3.0, the following is
>>> what I was going to implement.
>>>
>>> 1. When running under Python 3, applications SHOULD produce bytes
>>> output, status line and headers.
>> Sure.
>>
>>> This is effectively what we had before. The only difference is that
>>> clarify that the 'status line' values should also be bytes. This
>>> wasn't noted before. I had already updated the proposed WSGI 1.0
>>> amendments page to mention this.
>>>
>>> 2. When running under Python 3, servers and gateways MUST accept
>>> strings for output, status line and headers. Such strings must be
>>> converted to bytes output using 'latin-1'. If string cannot be
>>> converted then is treated as an error.
>>>
>>> This is again what we had before except that mention 'status line' value.
>> Sure.  ASCII for the status would be acceptable, as I believe that is
>> an HTTP constraint.
>>
>>> 3. When running under Python 3, servers MUST provide wsgi.input as a
>>> binary (byte) input stream.
>>>
>>> No change here.
>> Yep.
>>
>>> 4. When running under Python 3, servers MUST provide a text stream for
>>> wsgi.errors. In converting this to a byte stream for writing to a
>>> file, the default encoding would be applied.
>>>
>>> No real change here except to clarify that default encoding would
>>> apply. Use of default encoding though could be problematic if
>>> combining different WSGI components. This is because each WSGI
>>> component may have been developed on system with different default
>>> encoding and so one may expect to log characters that can't be written
>>> on a different setup. Not sure how you could solve that except to say
>>> people have default encoding be UTF-8 for portability.
>> Sure.  We might specify that the server should never give an encoding
>> error; it should use 'replace' or something to make sure it won't
>> fail.  Maybe it should be specified what should happen when bytes are
>> received.  I generally believe that error handling code should try to
>> be as robust as possible, so it shouldn't fail regardless of what it
>> is given.
> 
> Not that it matters, but looks like that for Apache/mod_wsgi
> wsgi.errors should be an instance of io.TextIOWrapper wrapping
> internal mod_wsgi specific buffer object providing interface
> compatible with io.BufferedIOBase. If someone uses write() on wrapper
> with bytes it will fail:
> 
>   TypeError: write() argument 1 must be str, not bytes
> 
> If someone use print() to output data, then bytes would be converted
> okay. That is:
> 
>   print(b'1234', file=environ['wsgi.errors'])
> 
> yields:
> 
>   b'1234'.
> 
> If 'replace' is used for errors, you do end up with data loss. Use of
> 'xmlcharrefreplace' at least preserves values as numbers, although for
> Apache at least, if use 'ascii' encoding, you get a bit of a mess as
> the backslashes get escaped again.
> 
> \\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10
> 
> instead of original:
> 
> \u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10
> 
> That is because Apache logging functions escape anything which isn't
> printable ASCII and in turn escapes backslash denoting escaped
> character.
> 
> If use encoding of utf-8 instead, then byte values get passed and
> Apache logging functions then just escape the non printable bytes
> instead so all up looks nicer.
> 
> \xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
> \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90
> 
> So for Apache/mod_wsgi at least, best thing to do seems to use
> 'replace' and 'utf-8' due to way that Apache error logging functions
> work.
> 
> I guess the point from this is that possibly should specify that
> wsgi.errors should be an instance of io.TextIOWrapper. A specific
> implementation should not use 'strict', but use 'replace' or
> 'backslashreplace' as makes sense, dependent on what encoding it needs
> to use and how any underlying logging system it overlays works. The
> intent overall being to preserve as much of raw information as
> possible.
> 
>>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>>> server variables as strings. Where such values are sourced from a byte
>>> string, be that a Python byte string or C string, they should be
>>> converted as 'UTF-8'. If a specific web server infrastructure is able
>>> to supp

Re: [Web-SIG] WSGI 2

2009-08-04 Thread Graham Dumpleton
Ian, know you have seen this before, but didn't realise you hadn't
cc'd the list. I have added a new response to part 4 of what you
originally sent that wasn't in first reply that went direct to you.

2009/8/4 Ian Bicking :
> On Mon, Aug 3, 2009 at 7:38 PM, Graham
> Dumpleton wrote:
>> So, for WSGI 1.0 style of interface and Python 3.0, the following is
>> what I was going to implement.
>>
>> 1. When running under Python 3, applications SHOULD produce bytes
>> output, status line and headers.
>
> Sure.
>
>> This is effectively what we had before. The only difference is that
>> clarify that the 'status line' values should also be bytes. This
>> wasn't noted before. I had already updated the proposed WSGI 1.0
>> amendments page to mention this.
>>
>> 2. When running under Python 3, servers and gateways MUST accept
>> strings for output, status line and headers. Such strings must be
>> converted to bytes output using 'latin-1'. If string cannot be
>> converted then is treated as an error.
>>
>> This is again what we had before except that mention 'status line' value.
>
> Sure.  ASCII for the status would be acceptable, as I believe that is
> an HTTP constraint.
>
>> 3. When running under Python 3, servers MUST provide wsgi.input as a
>> binary (byte) input stream.
>>
>> No change here.
>
> Yep.
>
>> 4. When running under Python 3, servers MUST provide a text stream for
>> wsgi.errors. In converting this to a byte stream for writing to a
>> file, the default encoding would be applied.
>>
>> No real change here except to clarify that default encoding would
>> apply. Use of default encoding though could be problematic if
>> combining different WSGI components. This is because each WSGI
>> component may have been developed on system with different default
>> encoding and so one may expect to log characters that can't be written
>> on a different setup. Not sure how you could solve that except to say
>> people have default encoding be UTF-8 for portability.
>
> Sure.  We might specify that the server should never give an encoding
> error; it should use 'replace' or something to make sure it won't
> fail.  Maybe it should be specified what should happen when bytes are
> received.  I generally believe that error handling code should try to
> be as robust as possible, so it shouldn't fail regardless of what it
> is given.

Not that it matters, but looks like that for Apache/mod_wsgi
wsgi.errors should be an instance of io.TextIOWrapper wrapping
internal mod_wsgi specific buffer object providing interface
compatible with io.BufferedIOBase. If someone uses write() on wrapper
with bytes it will fail:

  TypeError: write() argument 1 must be str, not bytes

If someone use print() to output data, then bytes would be converted
okay. That is:

  print(b'1234', file=environ['wsgi.errors'])

yields:

  b'1234'.

If 'replace' is used for errors, you do end up with data loss. Use of
'xmlcharrefreplace' at least preserves values as numbers, although for
Apache at least, if use 'ascii' encoding, you get a bit of a mess as
the backslashes get escaped again.

\\u0e40\\u0e2d\\u0e01\\u0e23\\u0e31\\u0e15\\u0e19\\u0e4c\\u0e40\\u0e25\\u0e47\\u0e01\\u0e1e\\u0e23\\u0e1b\\u0e23\\u0e30\\u0e40\\u0e2a\\u0e23\\u0e34\\u0e10

instead of original:

\u0e40\u0e2d\u0e01\u0e23\u0e31\u0e15\u0e19\u0e4c%20\u0e40\u0e25\u0e47\u0e01\u0e1e\u0e23\u0e1b\u0e23\u0e30\u0e40\u0e2a\u0e23\u0e34\u0e10

That is because Apache logging functions escape anything which isn't
printable ASCII and in turn escapes backslash denoting escaped
character.

If use encoding of utf-8 instead, then byte values get passed and
Apache logging functions then just escape the non printable bytes
instead so all up looks nicer.

\xe0\xb9\x80\xe0\xb8\xad\xe0\xb8\x81\xe0\xb8\xa3\xe0\xb8\xb1\xe0\xb8\x95\xe0\xb8\x99\xe0\xb9\x8c
\xe0\xb9\x80\xe0\xb8\xa5\xe0\xb9\x87\xe0\xb8\x81\xe0\xb8\x9e\xe0\xb8\xa3\xe0\xb8\x9b\xe0\xb8\xa3\xe0\xb8\xb0\xe0\xb9\x80\xe0\xb8\xaa\xe0\xb8\xa3\xe0\xb8\xb4\xe0\xb8\x90

So for Apache/mod_wsgi at least, best thing to do seems to use
'replace' and 'utf-8' due to way that Apache error logging functions
work.

I guess the point from this is that possibly should specify that
wsgi.errors should be an instance of io.TextIOWrapper. A specific
implementation should not use 'strict', but use 'replace' or
'backslashreplace' as makes sense, dependent on what encoding it needs
to use and how any underlying logging system it overlays works. The
intent overall being to preserve as much of raw information as
possible.

>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>> server variables as strings. Where such values are sourced from a byte
>> string, be that a Python byte string or C string, they should be
>> converted as 'UTF-8'. If a specific web server infrastructure is able
>> to support different encodings, then the WSGI adapter MAY provide a
>> way for a user of the WSGI adapter to customise on a global basis, or
>> on a per value basis what encoding is used, 

Re: [Web-SIG] WSGI 2

2009-08-03 Thread Graham Dumpleton
2009/8/4 P.J. Eby :
>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>> server variables as strings. Where such values are sourced from a byte
>> string, be that a Python byte string or C string, they should be
>> converted as 'UTF-8'. If a specific web server infrastructure is able
>> to support different encodings, then the WSGI adapter MAY provide a
>> way for a user of the WSGI adapter to customise on a global basis, or
>> on a per value basis what encoding is used, but this is entirely
>> optional. Note that there is no requirement to deal with RFC 2047.
>>
>> This is where I am going to diverge from what has been discussed before.
>>
>> The reason I am going to pass as UTF-8 and not latin-1 is that it
>> looks like Apache effectively only supports use of UTF-8. Since this
>> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
>> even CGI likely cannot handle anything besides UTF-8 then I really
>> can't see the point of trying to cater for a theoretical possibility
>> that some HTTP client could use something besides UTF-8. In other
>> words, the predominant case will be UTF-8, so let us target that.
>>
>> So, rather than burden every WSGI application with the need to convert
>> from latin-1 back to bytes and then to UTF-8, let the server deal with
>> it, with server using sensible default, and where server
>> infrastructure can handle a different encoding, then it can provide
>> option to use that encoding and WSGI application doesn't need to
>> change.
>
> Maybe I'm missing something here, but what if Apache receives something
> encoded in Latin-1?  AFAIR, form POST encoding is determined by the encoding
> of the page containing the form; that's of course something that only
> happens in the input body, but what about URLs?
>
> Mainly I'm wondering, what should the server do in the event they receive a
> byte string which is not valid UTF-8?  (Latin-1 doesn't have this problem,
> since there's no such thing as an invalid Latin-1 string, at least not at
> the encoding level.)

Can you clarify. We aren't talking about request content here. The
wsgi.input stream is still binary and up to WSGI application to decode
how it decides it should be decoded.

The only related thing I can think you are talking about is the form
target URL, which is an issue for GET and POST requests, or other
method types, from a form.

>> Also shown though that SCRIPT_NAME part has to be UTF-8
>> and we would really be entering fantasy land if you were somehow going
>> to cope with some different encoding for PATH_INFO and QUERY_STRING.
>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one
>> particular area means you are effectively bound to use UTF-8
>> everywhere else.
>
> I'm not clear on your logic here.  If I request foo/bar/baz (where baz
> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the
> script, then the (accented) baz is legitimate for pass-through to the
> application, no?

Technically, but what I am pointing out is that Apache pretty well
says that foo/bar needs to be UTF-8. If you are going to have
different parts of the one URL needing a different encoding to be
understood, personally I would say you asking for trouble. So, am
saying that UTF-8 needs to really apply more for sake of sanity and
portability.

> I just tried testing this with Firefox and Apache, and found that you can in
> fact pass such Latin-1 strings through to PATH_INFO, but at least in the
> case of Firefox, you have to %-escape them.  However, they are seen by
> Python (via os.environ) as latin-1 encoded byte strings.

By using % escapes you are in practice overriding the encoding that
the browser may be applying to URL if given raw character? What
happens if you were to paste the accented character direct into the
browser URL bar? Browsers I have played with would normally
automatically translate that as UTF-8 and send it as such, with %
encoding as necessary.

So I guess the problem is more where URLs are already % encoded when
coming back as href or form action because they may be in an encoding
incompatible with UTF-8 if it were to be clicked on.

>> Further example of why UTF-8 reaches into everything is mod_rewrite
>> module for Apache. This allows you to do stuff related to SCRIPT_NAME,
>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
>> configuration file has to be UTF-8. If URL isn't, then wouldn't be
>> possible to perform matches against non latin-1 characters in a
>> rewrite condition or rule. This is because your match string would be
>> in different encoded form to that in URL and so wouldn't match.
>
> Note that this still doesn't have any impact on the bytes that actually
> reach the application, which can be non-UTF8.  At minimum, the proposal is
> underspecified as to how to handle this case, which is as trivial to
> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s)
> of a URL.

The Apache server at 

Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 10:38 AM 8/4/2009 +1000, Graham Dumpleton wrote:

1. When running under Python 3, applications SHOULD produce bytes
output, status line and headers.

This is effectively what we had before. The only difference is that
clarify that the 'status line' values should also be bytes. This
wasn't noted before. I had already updated the proposed WSGI 1.0
amendments page to mention this.


+1



2. When running under Python 3, servers and gateways MUST accept
strings for output, status line and headers. Such strings must be
converted to bytes output using 'latin-1'. If string cannot be
converted then is treated as an error.

This is again what we had before except that mention 'status line' value.

3. When running under Python 3, servers MUST provide wsgi.input as a
binary (byte) input stream.

No change here.

4. When running under Python 3, servers MUST provide a text stream for
wsgi.errors. In converting this to a byte stream for writing to a
file, the default encoding would be applied.

No real change here except to clarify that default encoding would
apply. Use of default encoding though could be problematic if
combining different WSGI components. This is because each WSGI
component may have been developed on system with different default
encoding and so one may expect to log characters that can't be written
on a different setup. Not sure how you could solve that except to say
people have default encoding be UTF-8 for portability.


Also +1.



5. When running under Python 3, servers MUST provide CGI HTTP and
server variables as strings. Where such values are sourced from a byte
string, be that a Python byte string or C string, they should be
converted as 'UTF-8'. If a specific web server infrastructure is able
to support different encodings, then the WSGI adapter MAY provide a
way for a user of the WSGI adapter to customise on a global basis, or
on a per value basis what encoding is used, but this is entirely
optional. Note that there is no requirement to deal with RFC 2047.

This is where I am going to diverge from what has been discussed before.

The reason I am going to pass as UTF-8 and not latin-1 is that it
looks like Apache effectively only supports use of UTF-8. Since this
means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
even CGI likely cannot handle anything besides UTF-8 then I really
can't see the point of trying to cater for a theoretical possibility
that some HTTP client could use something besides UTF-8. In other
words, the predominant case will be UTF-8, so let us target that.

So, rather than burden every WSGI application with the need to convert
from latin-1 back to bytes and then to UTF-8, let the server deal with
it, with server using sensible default, and where server
infrastructure can handle a different encoding, then it can provide
option to use that encoding and WSGI application doesn't need to
change.


Maybe I'm missing something here, but what if Apache receives 
something encoded in Latin-1?  AFAIR, form POST encoding is 
determined by the encoding of the page containing the form; that's of 
course something that only happens in the input body, but what about URLs?


Mainly I'm wondering, what should the server do in the event they 
receive a byte string which is not valid UTF-8?  (Latin-1 doesn't 
have this problem, since there's no such thing as an invalid Latin-1 
string, at least not at the encoding level.)




Also shown though that SCRIPT_NAME part has to be UTF-8
and we would really be entering fantasy land if you were somehow going
to cope with some different encoding for PATH_INFO and QUERY_STRING.
Instead it is like the GPL, viral in nature. Use of UTF-8 in one
particular area means you are effectively bound to use UTF-8
everywhere else.


I'm not clear on your logic here.  If I request foo/bar/baz (where 
baz actually has an accent over the 'a') in latin-1 encoding, and 
foo/bar is the script, then the (accented) baz is legitimate for 
pass-through to the application, no?


I just tried testing this with Firefox and Apache, and found that you 
can in fact pass such Latin-1 strings through to PATH_INFO, but at 
least in the case of Firefox, you have to %-escape them.  However, 
they are seen by Python (via os.environ) as latin-1 encoded byte strings.




Further example of why UTF-8 reaches into everything is mod_rewrite
module for Apache. This allows you to do stuff related to SCRIPT_NAME,
PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
configuration file has to be UTF-8. If URL isn't, then wouldn't be
possible to perform matches against non latin-1 characters in a
rewrite condition or rule. This is because your match string would be
in different encoded form to that in URL and so wouldn't match.


Note that this still doesn't have any impact on the bytes that 
actually reach the application, which can be non-UTF8.  At minimum, 
the proposal is underspecified as to how to handle this case, which 
is as trivial to generate as stic

Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 10:48 AM 8/4/2009 +1000, Graham Dumpleton wrote:

2009/8/4 P.J. Eby :
> At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote:
>>
>> Would this be a new PEP or a revision?  I think it should be a new
>> PEP, as WSGI 1 remains valid and the same as it always was, and PEP
>> 333 describes that.
>
> +1 for a new PEP, since we'd be able to drop a lot of crufty examples and
> explanations about the cruddy bits.  wsgiref should add 1->2 and 2->1
> adapters.  (Although technically, running a WSGI 1 application in a WSGI 2
> server requires either threads or greenlets.)
>
> IMO, the main benefit of implementing WSGI 2 is to applications, not
> servers, with the possible exception of async servers (e.g. Twisted) that
> would prefer an iterator-only communications mode.  Such servers could
> refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2->1
> adapter.
>
> Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a
> standard 1->2 adapter to support WSGI 2.

Personally I don't believe we should be trying to support async
servers in the WSGI specification.


I'm not suggesting adding anything for async servers; I'm just saying 
that they will likely prefer to use WSGI 2 and use a 2->1 adapter to 
do WSGI 1 support, whereas synchronous servers will likely prefer the reverse.


The WSGI spec doesn't currently require streaming upload support, so 
if an async server wants to buffer the input (e.g. to a temp file) 
rather than trusting the application to handle reads, it's free to do 
so.  (And that's independent of whether it's WSGI 1 or 2 being used.)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Graham Dumpleton
2009/8/4 Mark Ramm :
>> In summary, just seems more sane to have stuff in WSGI environment be
>> dealt with as UTF-8.
>
> This sounds good to me.   Rack, Jack, and even java servlets seem to
> make this assumption without significant trouble, and if nearly all
> existing web servers do it internally, that's seems like an even
> better argument.

What do they do for response side though? Do they have the
bytes/string distinct that we are talking about, with bytes expected
by string accepted but only in representable as latin-1?

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Mark Ramm
> Personally I don't believe we should be trying to support async
> servers in the WSGI specification. Leave it simple and cater for the
> predominant case rather than make it complicated just to support what
> is going to be a minority deployment. It was async servers that got
> the whole discussion derailed last time. Leave input stream as is now
> as it is a known quantity and shown through actual use to work
> acceptably. Changing to an input iterator in my mind introduces too
> many unknowns around how input buffering is going to behave. In worst
> case you could really screw up performance because of a trickle of
> input coming into an application where no way for an application to
> control block size of what is read.

Yea, someone at work suggested that we should read from the input in a
file like way, and include a little chained file implementation in
wsgi ref, or just point to it in the spec, so people can read the
first 1000 bytes off of the input, and then pass along what they read,
plus the rest of the file in a way that's transparent to the
underlying application.   Makes good sense to me, and I'm pretty sure
I can find Rick's ittertools.chain inspired chained file
implementation.

--Mark
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Mark Ramm
> In summary, just seems more sane to have stuff in WSGI environment be
> dealt with as UTF-8.

This sounds good to me.   Rack, Jack, and even java servlets seem to
make this assumption without significant trouble, and if nearly all
existing web servers do it internally, that's seems like an even
better argument.

--Mark Ramm
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Graham Dumpleton
2009/8/4 P.J. Eby :
> At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote:
>>
>> Would this be a new PEP or a revision?  I think it should be a new
>> PEP, as WSGI 1 remains valid and the same as it always was, and PEP
>> 333 describes that.
>
> +1 for a new PEP, since we'd be able to drop a lot of crufty examples and
> explanations about the cruddy bits.  wsgiref should add 1->2 and 2->1
> adapters.  (Although technically, running a WSGI 1 application in a WSGI 2
> server requires either threads or greenlets.)
>
> IMO, the main benefit of implementing WSGI 2 is to applications, not
> servers, with the possible exception of async servers (e.g. Twisted) that
> would prefer an iterator-only communications mode.  Such servers could
> refactor their WSGI 1 support into a (thread or greenlet-based) WSGI 2->1
> adapter.
>
> Synchronous servers, OTOH, might as well stay WSGI 1, and simply use a
> standard 1->2 adapter to support WSGI 2.

Personally I don't believe we should be trying to support async
servers in the WSGI specification. Leave it simple and cater for the
predominant case rather than make it complicated just to support what
is going to be a minority deployment. It was async servers that got
the whole discussion derailed last time. Leave input stream as is now
as it is a known quantity and shown through actual use to work
acceptably. Changing to an input iterator in my mind introduces too
many unknowns around how input buffering is going to behave. In worst
case you could really screw up performance because of a trickle of
input coming into an application where no way for an application to
control block size of what is read.

Let us find some other way of supporting async servers, but not by
changing WSGI interface itself.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2

2009-08-03 Thread Graham Dumpleton
2009/8/4 Ian Bicking :
> So... what about WSGI 2?  Let's not completely drop the ball on this.
> I *think* we were largely in agreement; debate got distracted by some
> async stuff, but I don't think we particularly have to deal with that
> for WSGI 2.  I think we do more than enough if we figure out: WSGI in
> Python 3, i.e., with unicode; some basic errata kind of stuff, like
> readline signature; change the callable signature to remove
> start_response.
>
> Would this be a new PEP or a revision?  I think it should be a new
> PEP, as WSGI 1 remains valid and the same as it always was, and PEP
> 333 describes that.  Is there anyone willing to make the revisions?

But is the intention to skip straight to WSGI 2.0 for Python 3.0, with
start_response() being eliminated, or are we going to provide amended
WSGI 1.0 for Python 3.0? I can't see how we can avoid the latter and
so we should focus on that first rather that more fundamental changes
in WSGI 2.0.

In respect of WSGI 1.0 for Python 3.0, I have pretty well come to the
conclusion that where we were heading before on that in one area is
wrong. I was about to make changes to mod_wsgi in line with what I
believe should be done and just release it without consultation given
that I couldn't see any discussion reaching any conclusion about it
soon. Since you have sent this email I will try one last time to get a
resolution on WSGI 1.0 for Python 3.0. If can't get one, I guess the
choices are to release the change anyway and provide an incompatible
implementation to what others are guessing should be done, or just rip
all the code out and not support Python 3.0 at all. Either seem
entirely reasonable since there is no WSGI 1.0 specification for
Python 3.0 and the issue again looks to be getting avoided by skipping
to a discussion on WSGI 2.0 instead.

So, for WSGI 1.0 style of interface and Python 3.0, the following is
what I was going to implement.

1. When running under Python 3, applications SHOULD produce bytes
output, status line and headers.

This is effectively what we had before. The only difference is that
clarify that the 'status line' values should also be bytes. This
wasn't noted before. I had already updated the proposed WSGI 1.0
amendments page to mention this.

2. When running under Python 3, servers and gateways MUST accept
strings for output, status line and headers. Such strings must be
converted to bytes output using 'latin-1'. If string cannot be
converted then is treated as an error.

This is again what we had before except that mention 'status line' value.

3. When running under Python 3, servers MUST provide wsgi.input as a
binary (byte) input stream.

No change here.

4. When running under Python 3, servers MUST provide a text stream for
wsgi.errors. In converting this to a byte stream for writing to a
file, the default encoding would be applied.

No real change here except to clarify that default encoding would
apply. Use of default encoding though could be problematic if
combining different WSGI components. This is because each WSGI
component may have been developed on system with different default
encoding and so one may expect to log characters that can't be written
on a different setup. Not sure how you could solve that except to say
people have default encoding be UTF-8 for portability.

5. When running under Python 3, servers MUST provide CGI HTTP and
server variables as strings. Where such values are sourced from a byte
string, be that a Python byte string or C string, they should be
converted as 'UTF-8'. If a specific web server infrastructure is able
to support different encodings, then the WSGI adapter MAY provide a
way for a user of the WSGI adapter to customise on a global basis, or
on a per value basis what encoding is used, but this is entirely
optional. Note that there is no requirement to deal with RFC 2047.

This is where I am going to diverge from what has been discussed before.

The reason I am going to pass as UTF-8 and not latin-1 is that it
looks like Apache effectively only supports use of UTF-8. Since this
means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
even CGI likely cannot handle anything besides UTF-8 then I really
can't see the point of trying to cater for a theoretical possibility
that some HTTP client could use something besides UTF-8. In other
words, the predominant case will be UTF-8, so let us target that.

So, rather than burden every WSGI application with the need to convert
from latin-1 back to bytes and then to UTF-8, let the server deal with
it, with server using sensible default, and where server
infrastructure can handle a different encoding, then it can provide
option to use that encoding and WSGI application doesn't need to
change.

Now, the reason why Apache can't really handle anything besides UTF-8
relates to how filenames are encoded in the file system.

Taking Windows first as it is the more obvious case. What Apache does
there is take whatever path it has mapping to a s

Re: [Web-SIG] WSGI 2

2009-08-03 Thread P.J. Eby

At 04:32 PM 8/3/2009 -0500, Ian Bicking wrote:

Would this be a new PEP or a revision?  I think it should be a new
PEP, as WSGI 1 remains valid and the same as it always was, and PEP
333 describes that.


+1 for a new PEP, since we'd be able to drop a lot of crufty examples 
and explanations about the cruddy bits.  wsgiref should add 1->2 and 
2->1 adapters.  (Although technically, running a WSGI 1 application 
in a WSGI 2 server requires either threads or greenlets.)


IMO, the main benefit of implementing WSGI 2 is to applications, not 
servers, with the possible exception of async servers (e.g. Twisted) 
that would prefer an iterator-only communications mode.  Such servers 
could refactor their WSGI 1 support into a (thread or greenlet-based) 
WSGI 2->1 adapter.


Synchronous servers, OTOH, might as well stay WSGI 1, and simply use 
a standard 1->2 adapter to support WSGI 2.



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2

2009-08-03 Thread Ian Bicking
So... what about WSGI 2?  Let's not completely drop the ball on this.
I *think* we were largely in agreement; debate got distracted by some
async stuff, but I don't think we particularly have to deal with that
for WSGI 2.  I think we do more than enough if we figure out: WSGI in
Python 3, i.e., with unicode; some basic errata kind of stuff, like
readline signature; change the callable signature to remove
start_response.

Would this be a new PEP or a revision?  I think it should be a new
PEP, as WSGI 1 remains valid and the same as it always was, and PEP
333 describes that.  Is there anyone willing to make the revisions?

-- 
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2 and SERVER_PROTOCOL

2007-03-30 Thread Robert Brewer
RFC 2145 says:

  "An implementation of HTTP/x.b sending a message to a
   recipient whose version is known to be HTTP/x.a, a < b,
   MUST NOT depend on the recipient understanding a header
   not defined in the specification for HTTP/x.a.  For example,
   HTTP/1.0 clients cannot be expected to understand chunked
   encodings, and so an HTTP/1.1 server must never send
   "Transfer-Encoding: chunked" in response to an HTTP/1.0
   request."

In specific cases, implementations can choose to send some HTTP/1.1
headers to HTTP/1.0 clients, but in the general case, the solution is
usually to downgrade the entire HTTP response to 1.0 features only.

Under WSGI, "an implementation of HTTP/x.b" is an emergent property of
the entire stack; servers, middleware, and applications all share this
responsibility to downgrade the entire response to HTTP/1.0 features if
any of the other components is not HTTP/1.1 compliant.

Unfortunately, the WSGI 1.0 spec doesn't require WSGI servers to tell
WSGI applications what version of HTTP they support. If a WSGI origin
server "fails to satisfy one or more of the MUST or REQUIRED level
requirements for the protocols it implements" (as too many WSGI servers
do!), WSGI applications have no standardized way of knowing this, and
may output headers which contradict the version number output by the
WSGI server.

CherryPy hacks around this by having the origin server send a custom
entry in the WSGI environ called "ACTUAL_SERVER_PROTOCOL", which tells
the rest of the WSGI stack the version for which the origin server is at
least conditionally compliant:

# Compare request and server HTTP protocol versions, in case our
# server does not support the requested protocol. Limit our output
# to min(req, server). We want the following output:
# requestserver actual written   supported response
# protocol   protocol  response protocolfeature set
# a 1.01.0   1.01.0
# b 1.01.1   1.11.0
# c 1.11.0   1.01.0
# d 1.11.1   1.11.1
# Notice that, in (b), the response will be "HTTP/1.1" even though
# the client only understands 1.0. RFC 2616 10.5.6 says we should
# only return 505 if the _major_ version is different.
rp = int(req_protocol[5]), int(req_protocol[7])
sp = int(server.protocol[5]), int(server.protocol[7])
if sp[0] != rp[0]:
self.simple_response("505 HTTP Version Not Supported")
return

# Bah. "SERVER_PROTOCOL" is actually the REQUEST protocol.
environ["SERVER_PROTOCOL"] = req_protocol

# set a non-standard environ entry so the WSGI app can know what
# the *real* server protocol is (and what features to support).
# See http://www.faqs.org/rfcs/rfc2145.html.
environ["ACTUAL_SERVER_PROTOCOL"] = server.protocol
self.response_protocol = "HTTP/%s.%s" % min(rp, sp)

The "application-side" bits of CherryPy inspect this value (if present)
and perform the same min(rp, sp) calculation as the server in order to
determine which features to support.

WSGI 2 should, at the least, add a standard environ entry similar to
ACTUAL_SERVER_PROTOCOL. This would provide the minimum enforcement of
full-stack compliance, since WSGI origin servers tend to be the
least-compliant portions of any WSGI stack. As far as I am aware, the
CherryPy 3 wsgiserver is the only one currently claiming to be even
"conditionally compliant" with HTTP/1.1.

WSGI 2 might, in addition, require WSGI origin servers to perform the
min(rp, sp) calculation once and pass the result in a new
"RESPONSE_PROTOCOL_SUPPORT" environ entry. Note this is not necessarily
the same version number as what will be output in the response
Status-Line:

  "An HTTP server SHOULD send a response version equal to
   the highest version for which the server is at least
   conditionally compliant, and whose major version is
   less than or equal to the one received in the request.
   An HTTP server MUST NOT send a version for which it is
   not at least conditionally compliant.  A server MAY
   send a 505 (HTTP Version Not Supported) response if [it]
   cannot send a response using the major version used
   in the client's request."

If a given WSGI application or middleware component is not at least
conditionally compliant with HTTP/1.1, the WSGI origin server should
downgrade the response version it emits in the Status-Line, but has no
standardized way to be informed of this state of affairs. Currently, the
burden tends to fall on those who compose WSGI stacks to manually
instruct the WSGI origin server to always output HTTP/1.0 if any WSGI
component is not conditionally compliant with HTTP/1.1. This issue may
need to be addressed in a separate spec covering the composition of WSGI
stacks.


Robert Brewer
System Architect
Amor Ministries
[EMAIL PROTECTED]
_