Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread And Clover

On 07/14/2010 06:43 AM, Ian Bicking wrote:


There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
and HTTP_COOKIE.


(And of those, PATH_INFO is the only one that really matters, in that 
no-one really uses non-ASCII script filenames, and non-ASCII characters 
in Cookie/Set-Cookie are still handled so differently/brokenly across 
browsers that you can't rely on them at all.)



* I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
exclusively with encoded versions


For compatibility with existing apps, how about keeping the existing 
SCRIPT_NAME and PATH_INFO as-is (with all their problems), and 
specifying that the new 'raw' versions (whatever they are called) are 
added only if they really are raw, not reconstructed.


Then existing scripts that don't care about non-ASCII and slashes can 
carry on as before, and for apps that do care about them, they'll be 
able to be *sure* the input is correct. Or they can fall back to 
PATH_INFO when not present, and avoid producing these kind of URLs in 
response.


(Or an app might have enough special knowledge to try other fallback 
mechanisms when the raw versions are unavailable, such as REQUEST_URI or 
Windows ctypes envvar hacking. But if the server/gateway has good raw 
paths it shouldn't bother use these.)


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Friday, July 16, 2010, And Clover  wrote:
> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>
> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
>
>
> (And of those, PATH_INFO is the only one that really matters, in that no-one 
> really uses non-ASCII script filenames, and non-ASCII characters in 
> Cookie/Set-Cookie are still handled so differently/brokenly across browsers 
> that you can't rely on them at all.)
>
>
> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions
>
>
> For compatibility with existing apps, how about keeping the existing 
> SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying 
> that the new 'raw' versions (whatever they are called) are added only if they 
> really are raw, not reconstructed.
>
> Then existing scripts that don't care about non-ASCII and slashes can carry 
> on as before, and for apps that do care about them, they'll be able to be 
> *sure* the input is correct. Or they can fall back to PATH_INFO when not 
> present, and avoid producing these kind of URLs in response.
>
> (Or an app might have enough special knowledge to try other fallback 
> mechanisms when the raw versions are unavailable, such as REQUEST_URI or 
> Windows ctypes envvar hacking. But if the server/gateway has good raw paths 
> it shouldn't bother use these.)

Which is exactly what I have suggested in the past. If you do that,
one has to ask the question, given it is more convention than
anything, why it isn't just a x-wsgiorg extension specification like
routing args is rather than a core part of the WSGI specification.
Servers could still implement the extension as they are able to and
don't have to worry about changing core specification then and what we
have now stands.

Graham

> --
> And Clover
> mailto:a...@doxdesk.com
> http://www.doxdesk.com/
> ___
> Web-SIG mailing list
> Web-SIG@python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: 
> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread And Clover

On 07/16/2010 12:07 PM, Graham Dumpleton wrote:


If you do that, one has to ask the question, given it is more convention than
anything, why it isn't just a x-wsgiorg extension specification


Yes, fine by me either way.

I just want to be able to say "this application can use Unicode paths 
when run on a server/gateway that supports ", 
rather than the current mess of "you can have Unicode paths if you use 
one of the dozen different server-and-platform combinations we've 
specifically coded workarounds for".


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 4:33 AM, And Clover  wrote:

> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>  There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
>> and HTTP_COOKIE.
>>
>
> (And of those, PATH_INFO is the only one that really matters, in that
> no-one really uses non-ASCII script filenames, and non-ASCII characters in
> Cookie/Set-Cookie are still handled so differently/brokenly across browsers
> that you can't rely on them at all.)
>
>
>  * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
>> exclusively with encoded versions
>>
>
> For compatibility with existing apps, how about keeping the existing
> SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying
> that the new 'raw' versions (whatever they are called) are added only if
> they really are raw, not reconstructed.
>

Having two ways of expressing the same information will lead to bugs related
to which data is canonical.  If an application is using
SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
weird bugs and code will disagree about which one is correct.  Since %2f can
exist in the raw versions, there isn't even a way to chunk the two variables
in the same way.

Then existing scripts that don't care about non-ASCII and slashes can carry
> on as before, and for apps that do care about them, they'll be able to be
> *sure* the input is correct. Or they can fall back to PATH_INFO when not
> present, and avoid producing these kind of URLs in response.
>

I don't think it works to imagine you can just not care about non-ASCII.
Requests come in.  WSGI should represent those requests.  If a request comes
in with non-ASCII bytes then WSGI needs to do *something* with it.  I don't
want to have to configure servers with application policy; servers should
just work.

And this doesn't help with Python 3: either we have byte values of
SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think
bytes will be more awkward to port to than text, and inconsistent with other
WSGI values.  If we have text then we have to choose an encoding.  Latin1
will work, but it will be the exact wrong encoding most of the time as UTF-8
is the typical  (unlike other headers, where Latin1 will mostly be an okay
encoding, or as good a guess as we have).  If we firmly remove these keys
then we can avoid this choice entirely... and we conveniently also get a
better representation of the request.

Note that libraries can smooth over this change; WebOb for instance will
certainly still support req.script_name/req.path_info by decoding the raw
values.  Admittedly lots of code use these values directly... but at least
if they get a KeyError the port/fix will be obvious (as opposed to out of
sync values, which will only emerge as a problem occasionally -- I'd rather
not invite more occasional bugs).

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Chris McDonough
On Fri, 2010-07-16 at 11:07 -0500, Ian Bicking wrote:

> And this doesn't help with Python 3: either we have byte values of
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
> think bytes will be more awkward to port to than text, and
> inconsistent with other WSGI values.  If we have text then we have to
> choose an encoding.  Latin1 will work, but it will be the exact wrong
> encoding most of the time as UTF-8 is the typical  (unlike other
> headers, where Latin1 will mostly be an okay encoding, or as good a
> guess as we have).  If we firmly remove these keys then we can avoid
> this choice entirely... and we conveniently also get a better
> representation of the request.

My $.02: I'd rather lobby the core folks for a string ABC (which we can
hook with a stringlike bytes type) and consider all 3.X releases made so
far "dead to WSGI" than to have to tunnel arbitrary bytes through some
misleading Unicode encoding.

- C


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 12:28 PM, Chris McDonough  wrote:

> On Fri, 2010-07-16 at 11:07 -0500, Ian Bicking wrote:
>
> > And this doesn't help with Python 3: either we have byte values of
> > SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
> > think bytes will be more awkward to port to than text, and
> > inconsistent with other WSGI values.  If we have text then we have to
> > choose an encoding.  Latin1 will work, but it will be the exact wrong
> > encoding most of the time as UTF-8 is the typical  (unlike other
> > headers, where Latin1 will mostly be an okay encoding, or as good a
> > guess as we have).  If we firmly remove these keys then we can avoid
> > this choice entirely... and we conveniently also get a better
> > representation of the request.
>
> My $.02: I'd rather lobby the core folks for a string ABC (which we can
> hook with a stringlike bytes type) and consider all 3.X releases made so
> far "dead to WSGI" than to have to tunnel arbitrary bytes through some
> misleading Unicode encoding.
>

While I think it would be generally useful, it's also a long way off at
best, with serious performance dangers that could torpedo the whole thing.
But... I'm also unsure how it would help here, except perhaps we could
incrementally annotate bytes with an encoding?  Well, I don't really know.
Treating the raw request path as text is easy enough, as it should always be
ASCII anyway.  We don't have to worry what is "right" or "wrong" in this
case.

We could make everything bytes and be done with it, but it would make it
much harder to port Python 2 WSGI code to Python 3.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread P.J. Eby

At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
And this doesn't help with Python 3: either we have byte values of 
SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I 
think bytes will be more awkward to port to than text, and 
inconsistent with other WSGI values.


OTOH, it has the tremendous advantage of pushing the encoding 
question onto the app (or framework) developer...  who's really the 
only one who can make the right decision for their particular 
application.  And personally, I'd rather have clear boundaries 
between text and bytes, such that porting (even if tedious or 
awkward) is *consistent*, and clear as to when you're finished, not, 
"oh, did I check to make sure I converted SCRIPT_NAME and 
PATH_INFO...  not just in my app code, but in all the library code I 
call *from* my app?"


IOW, the bytes/string discussion on Python-dev has kind of led me to 
realize that we might just as well make the *entire* stack bytes 
(incoming and outgoing headers *and* streams), and rewrite that bit 
in PEP 333 about using str on "Python 3000" to say we go with bytes 
on Python 3+ for everything that's a str in today's WSGI.


Or, to put it another way, if I knew then what I know *now*, I think 
I'd have written the PEP the other way around, such that the use of 
'str' in WSGI would be a substitute for the future 'bytes' type, 
rather than viewing some byte strings as a forward-compatible 
substitute for Py3K unicode strings.


Of course, this would be a WSGI 2 change, but IMO we're better off 
making a clean break with backward compatibility here anyway, rather 
than having conditionals.  Also, going with bytes everywhere means we 
don't have to rename SCRIPT_NAME and PATH_INFO, which in turn avoids 
deeper rewrites being required in today's apps.


(Hm.  Although actually, I suppose we *could* just borrow the time 
machine and pretend that WSGI called for "byte-strings everywhere" 
all along...)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] Unicode fundamentals

2010-07-16 Thread travis+ml-python-web
BTW, if you're a noob like me and can't follow the Unicode stuff,
I once read this:

http://www.joelonsoftware.com/articles/Unicode.html

I need to read it again before commenting, but I seem to recall it
being edifying, if not particularly memorable. ;-)
-- 
A Weapon of Mass Construction
My emails do not have attachments; it's a digital signature that your mail
program doesn't understand. | http://www.subspacefield.org/~travis/ 
If you are a spammer, please email j...@subspacefield.org to get blacklisted.


pgp3xEtNCj7o5.pgp
Description: PGP signature
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Stephan Richter
On Friday, July 16, 2010, Ian Bicking wrote:
> We could make everything bytes and be done with it, but it would make it
> much harder to port Python 2 WSGI code to Python 3.

I think this might be best having seen all of the discussion. One could easily 
write a compatibility middleware that makes porting Python 2 applications easy 
or even completely transparent (from a WSGI spec point of view).

Regards,
Stephan
-- 
Entrepreneur and Software Geek
Google me. "Zope Stephan Richter"
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Gustavo Narea
Hello,

Ian said:
> Having two ways of expressing the same information will lead to bugs
> related to which data is canonical.  If an application is using
> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
> weird bugs and code will disagree about which one is correct.  Since %2f
> can exist in the raw versions, there isn't even a way to chunk the two
> variables in the same way.

I can't agree more.

I would propose the following, and excuse me in advance if this has already 
been proposed and discarded -- I've tried to follow this topic on the mailing 
list over the past few months, until it becomes an endless discussion.

I think only the raw values should be available. Even if a middleware changes 
them, it must put them with raw values. And because you cannot change those 
values without knowing what encoding the request uses, the character encoding 
*must* be present.

I know that sounds easy but it's not, because browsers don't specify the 
charset in the Content-Type and instead they generate a new request using the 
charset from the previous response. So the charset is unknown to the 
server/gateway and the middleware stack.

So, what we could do is introduce a mandatory variable called, say, 
wsgi.charset, and would be used as follows:
 - It MUST be set by the server or gateway on every request.
 - Every middleware or application that reads or writes these values MUST use 
the charset specified in wsgi.charset.
 - If a server, gateway, middleware or application wants to change the charset 
and it is possible*, it MUST convert the *entire* request into that charset 
and update wsgi.charset accordingly.
 - When the charset is not specified in the HTTP request, UTF-8 MUST be 
assumed by the server/gateway. Unless another default charset has been 
specified by the user.

I think/hope that will solve all the problems.

What happens when a WSGI application is actually made up two WSGI applications 
and they send the responses in different charsets? If it's not possible to 
configure them so that they both use the same charsets, then one of them would 
have to be wrapped by a middleware which:
 - On egress, converts the responses using the charset used by the other 
application.
 - On ingress, if the charset is not specified in the request, it will assume 
it's the one used by the other application, and thus it will convert the 
request using the charset supported by the wrapped application.

It would look like this:
===
def application(environ, start_response):
if environ.startswith("/trac/"):
# Say Trac only supports Latin-1 and we want responses to use UTF-8:
app = trac.web.main.dispatch_request
app = CharsetNormalizer(app, response="latin-1", request="utf8")
else:
# myapp uses UTF-8
app = myapp
return app(environ, start_response)
===

Then there's the string vs bytes issue. Bytes would be the natural choice to 
represent these raw values, but it would probably cause more trouble than they 
solve. So, I think they should be strings that contain the the ASCII raw 
encoded values (i.e., str on both versions of Python).

What do you think about this? Again, sorry if this has been discarded before! 
:)

* For example, you can always convert Latin-1 to UTF-8, but not every UTF-8 
string can be converted to Latin-1.
-- 
Gustavo Narea .
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby  wrote:

> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>
>> And this doesn't help with Python 3: either we have byte values of
>> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think
>> bytes will be more awkward to port to than text, and inconsistent with other
>> WSGI values.
>>
>
> OTOH, it has the tremendous advantage of pushing the encoding question onto
> the app (or framework) developer...  who's really the only one who can make
> the right decision for their particular application.  And personally, I'd
> rather have clear boundaries between text and bytes, such that porting (even
> if tedious or awkward) is *consistent*, and clear as to when you're
> finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and
> PATH_INFO...  not just in my app code, but in all the library code I call
> *from* my app?"
>
> IOW, the bytes/string discussion on Python-dev has kind of led me to
> realize that we might just as well make the *entire* stack bytes (incoming
> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
> using str on "Python 3000" to say we go with bytes on Python 3+ for
> everything that's a str in today's WSGI.
>

This was my first intuition too, until I started thinking in more detail
about the particular values involved.  Some obviously are textish, like
environ['SERVER_NAME'].  Not a very useful value, but definitely text.

Basically all the internal strings are textish, so we're left with:

wsgi.url_scheme
SCRIPT_NAME/PATH_INFO
QUERY_STRING
HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
response status
response headers (name and value)

And there's a few things like REMOTE_USER that are kind of in the middle.
Everyone is in agreement that bodies should be bytes.

One initial problem is that the Python 3 stdlib handles bytes poorly, so for
instance there's no good way to reconstruct the URL using the stdlib.  That
explains certain tensions, but I think we should ignore that, and in fact
that's what Python-Dev seemed to say pretty clearly.

Now, the other keys:

wsgi.url_scheme: clearly ASCII

SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
legacy encoding.
raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL
encoding happens at the byte layer, so a server could reasonably URL encode
any non-ASCII characters without imposing any encoding.

QUERY_STRING: should be ASCII, same as raw request path

headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by
the specification.  The spec also implies you have use the RFC2047 inline
encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
supporting it would probably be a bad idea for security reasons.  The
Atompub spec (reasonably modern) specifically says Title headers should be
encoded with RFC2047 (if they are not ISO-8859-1):
http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
decoding this kind of encoding at the application layer seems reasonable to
me.

cookie header: this specific header can easily have multiple encodings, as
the browser encodes data then treats it as opaque bytes, so a cookie can be
set via UTF-8 one place, Latin1 another, and those coexist in one header.
That is, there is no real encoding and this should be treated as bytes.
(Latin1 is an approximation of bytes... a spotty way to treat bytes, but
entirely workable.)

response status: I believe the spec says this must be Latin1/ISO-8859-1.  In
practice it is almost always ASCII, and since it is not user-visible it's
not something that really needs localization.

response headers: the spec implies Latin1, in practice the Set-Cookie header
is bytes (since interoperation with wonky legacy systems is not uncommon).
I'm not sure of any other exceptions?


So... to me it seems pretty reasonable for HTTP specifically that text can
work.  And if feels weird that, say, environ['SERVER_NAME'] be text and
environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
should be in that mode.  And it would also be weird if
environ['SERVER_NAME'] was bytes.

In the past when we've gotten down to specifics, the only holdup has been
SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Gustavo Narea
Gustavo said:
>  - On ingress, if the charset is not specified in the request, it will
> assume  it's the one used by the other application, and thus it will
> convert the request using the charset supported by the wrapped
> application.

That should actually be:

"On ingress, if the charset in wsgi.charset differs from the charset supported 
by the wrapped application, the request will be converted into the charset 
supported by the wrapped application."
-- 
Gustavo Narea .
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

P.J. Eby wrote:

> (Hm.  Although actually, I suppose we *could* just borrow the time 
> machine and pretend that WSGI called for "byte-strings everywhere" 
> all along...)

I like the idea of pushing responsibility for decoding stuff into the
framework / app writer's hands.  OTOH, doesn't that hose authors of
existing middleware, due to the borkedness of working with bytes in Python3?


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxA0iwACgkQ+gerLs4ltQ44BgCcD9BGPD7cvJb+azx7akBUqVHc
X0wAnA3alzFWBXa1jBcEixyrFBRk6dbh
=m9TD
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ian Bicking wrote:

>> IOW, the bytes/string discussion on Python-dev has kind of led me to
>> realize that we might just as well make the *entire* stack bytes (incoming
>> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
>> using str on "Python 3000" to say we go with bytes on Python 3+ for
>> everything that's a str in today's WSGI.
>>
> 
> This was my first intuition too, until I started thinking in more detail
> about the particular values involved.  Some obviously are textish, like
> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
> 
> Basically all the internal strings are textish, so we're left with:

What do you mean by "internal"?  Anything in the headers or the CGI
environment is intrinsically "bytes-ish" to me.  Do you mean that you
want application programmers to have them transparently decoded?  If so,
we can make that the responsibility of the non-middleware framework /
application.

> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
> 
> And there's a few things like REMOTE_USER that are kind of in the middle.
> Everyone is in agreement that bodies should be bytes.
> 
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for
> instance there's no good way to reconstruct the URL using the stdlib.  That
> explains certain tensions, but I think we should ignore that, and in fact
> that's what Python-Dev seemed to say pretty clearly.

python-dev seems to me to be coming to the realization that they should
have tried harder to make real-world apps work before they froze their
choices.

> Now, the other keys:
> 
> wsgi.url_scheme: clearly ASCII
> 
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL
> encoding happens at the byte layer, so a server could reasonably URL encode
> any non-ASCII characters without imposing any encoding.
> 
> QUERY_STRING: should be ASCII, same as raw request path
> 
> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by
> the specification.  The spec also implies you have use the RFC2047 inline
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
> supporting it would probably be a bad idea for security reasons.  The
> Atompub spec (reasonably modern) specifically says Title headers should be
> encoded with RFC2047 (if they are not ISO-8859-1):
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
> decoding this kind of encoding at the application layer seems reasonable to
> me.
> 
> cookie header: this specific header can easily have multiple encodings, as
> the browser encodes data then treats it as opaque bytes, so a cookie can be
> set via UTF-8 one place, Latin1 another, and those coexist in one header.
> That is, there is no real encoding and this should be treated as bytes.
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but
> entirely workable.)
> 
> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In
> practice it is almost always ASCII, and since it is not user-visible it's
> not something that really needs localization.
> 
> response headers: the spec implies Latin1, in practice the Set-Cookie header
> is bytes (since interoperation with wonky legacy systems is not uncommon).
> I'm not sure of any other exceptions?
> 
> 
> So... to me it seems pretty reasonable for HTTP specifically that text can
> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
> should be in that mode.  And it would also be weird if
> environ['SERVER_NAME'] was bytes.


> In the past when we've gotten down to specifics, the only holdup has been
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

I think I favor PJE's suggestion:  let WSGI deal only in bytes.



Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxA03wACgkQ+gerLs4ltQ7x0gCg03P1cT9RsJhagBERqY6SbLQ8
zu0An0T0YoFjzAb+2WjWp20DS3VeP68u
=ybUr
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 4:47 PM, Tres Seaver  wrote:

>  > Basically all the internal strings are textish, so we're left with:
>
> What do you mean by "internal"?  Anything in the headers or the CGI
> environment is intrinsically "bytes-ish" to me.  Do you mean that you
> want application programmers to have them transparently decoded?  If so,
> we can make that the responsibility of the non-middleware framework /
> application.
>

By internal I mean all the CGI variables that aren't representing HTTP, like
SERVER_NAME.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Chris McDonough
On Fri, 2010-07-16 at 17:47 -0400, Tres Seaver wrote:

> > In the past when we've gotten down to specifics, the only holdup has been
> > SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
> 
> I think I favor PJE's suggestion:  let WSGI deal only in bytes.

I'd prefer that WSGI 2 was defined in terms of a "bytes with benefits"
type (Python 2's ``str`` with an optional encoding attribute as a hint
for cast to unicode str) instead of Python 3-style bytes.

But if I had to make the Hobson's choice between Python 3 style bytes
and Python 3 style str, I'd choose bytes.  If I then needed to write
middleware or applications, I'd use WebOb or an equivalent library to
enable a policy which converted those bytes to strings on my behalf.
Making it easy to write "raw" middleware or applications without using
such a library doesn't seem as compelling a goal as being able to easily
write one which allowed me direct control at the raw level.

- C


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 5:08 PM, Chris McDonough  wrote:

> On Fri, 2010-07-16 at 17:47 -0400, Tres Seaver wrote:
>
> > > In the past when we've gotten down to specifics, the only holdup has
> been
> > > SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
> >
> > I think I favor PJE's suggestion:  let WSGI deal only in bytes.
>
> I'd prefer that WSGI 2 was defined in terms of a "bytes with benefits"
> type (Python 2's ``str`` with an optional encoding attribute as a hint
> for cast to unicode str) instead of Python 3-style bytes.
>
> But if I had to make the Hobson's choice between Python 3 style bytes
> and Python 3 style str, I'd choose bytes.  If I then needed to write
> middleware or applications, I'd use WebOb or an equivalent library to
> enable a policy which converted those bytes to strings on my behalf.
> Making it easy to write "raw" middleware or applications without using
> such a library doesn't seem as compelling a goal as being able to easily
> write one which allowed me direct control at the raw level.
>

What are the concrete problems you envision with text request headers, text
(URL-quoted) path, and text response status and headers?

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 5:06 PM, Ian Bicking  wrote:

> On Fri, Jul 16, 2010 at 4:47 PM, Tres Seaver wrote:
>
>>  > Basically all the internal strings are textish, so we're left with:
>>
>> What do you mean by "internal"?  Anything in the headers or the CGI
>> environment is intrinsically "bytes-ish" to me.  Do you mean that you
>> want application programmers to have them transparently decoded?  If so,
>> we can make that the responsibility of the non-middleware framework /
>> application.
>>
>
> By internal I mean all the CGI variables that aren't representing HTTP,
> like SERVER_NAME.
>

Actually I was thinking SERVER_SOFTWARE, though SERVER_NAME is somewhat
similar as it doesn't come from HTTP, it comes from server configuration.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Chris McDonough
On Fri, 2010-07-16 at 17:11 -0500, Ian Bicking wrote:
> On Fri, Jul 16, 2010 at 5:08 PM, Chris McDonough 
> wrote:
> On Fri, 2010-07-16 at 17:47 -0400, Tres Seaver wrote:
> 
> > > In the past when we've gotten down to specifics, the only
> holdup has been
> > > SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate
> those.
> >
> > I think I favor PJE's suggestion:  let WSGI deal only in
> bytes.
> 
> 
> I'd prefer that WSGI 2 was defined in terms of a "bytes with
> benefits"
> type (Python 2's ``str`` with an optional encoding attribute
> as a hint
> for cast to unicode str) instead of Python 3-style bytes.
> 
> But if I had to make the Hobson's choice between Python 3
> style bytes
> and Python 3 style str, I'd choose bytes.  If I then needed to
> write
> middleware or applications, I'd use WebOb or an equivalent
> library to
> enable a policy which converted those bytes to strings on my
> behalf.
> Making it easy to write "raw" middleware or applications
> without using
> such a library doesn't seem as compelling a goal as being able
> to easily
> write one which allowed me direct control at the raw level.
> 
> What are the concrete problems you envision with text request headers,
> text (URL-quoted) path, and text response status and headers?

Documentation is the main reason.  For example, the documentation for
making sense of path_info segments in a WSGI that used unicodey-strings
would, as I understand it, read something like this:

"""
The PATH_INFO environment variable is a string.  To decode it,

- First, split it on slashes::

segments = PATH_INFO.split('/')

- Then turn each segment into bytes::

bytes_segments = [ bytes(x, encoding='latin-1') for x in segments ]

- Then, de-encode each segment's urlencoded portions:

urldecoded_segments = [ urllib.unquote(x) for x in bytes_segments ]

- Then re-encode each urldecoded segment into the encoding expected
  by your application

app_segments = [ str(x, encoding='utf-8') for x in 
 urldecoded_segments ]

.. note:: We decode from latin-1 above because WSGI tunnels the bytes
representing the PATH_INFO by way of a string type which contains bytes
as characters.
"""

That looks pretty apologetic to me, and to be honest, I'm not even sure
it will work reliably in the face of existing/legacy applications which
have emitted URLs that are not url-encoded properly if those old URLs
need to be supported.   http://bugs.python.org/issue8136 contains a
variation on this theme.

I'd much rather say be able to say:

"""
The PATH_INFO environment variable is a ``bytes-with-benefits`` type.
To decode it:

- First, split it on slashes::

segments = PATH_INFO.split('/')

- Then, de-encode each segment's urlencoded portions:

urldecoded_segments = [ urllib.unquote(x) for x in segments ]

- Then re-encode each urldecoded segment into the encoding expected
  by your application

app_segments = [ str(x, encoding='utf-8') for x in 
 urldecoded_segments ]
"""

Let me know if I'm missing something.

- C



___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread P.J. Eby

At 02:28 PM 7/16/2010 -0500, Ian Bicking wrote:
On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby 
<p...@telecommunity.com> wrote:

At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
And this doesn't help with Python 3: either we have byte values of 
SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I 
think bytes will be more awkward to port to than text, and 
inconsistent with other WSGI values.



OTOH, it has the tremendous advantage of pushing the encoding 
question onto the app (or framework) developer... Â who's really the 
only one who can make the right decision for their particular 
application. Â And personally, I'd rather have clear boundaries 
between text and bytes, such that porting (even if tedious or 
awkward) is *consistent*, and clear as to when you're finished, not, 
"oh, did I check to make sure I converted SCRIPT_NAME and 
PATH_INFO... Â not just in my app code, but in all the library code 
I call *from* my app?"


IOW, the bytes/string discussion on Python-dev has kind of led me to 
realize that we might just as well make the *entire* stack bytes 
(incoming and outgoing headers *and* streams), and rewrite that bit 
in PEP 333 about using str on "Python 3000" to say we go with bytes 
on Python 3+ for everything that's a str in today's WSGI.



This was my first intuition too, until I started thinking in more 
detail about the particular values involved.  Some obviously are 
textish, like environ['SERVER_NAME'].  Not a very useful value, but 
definitely text.


Basically all the internal strings are textish, so we're left with:

wsgi.url_scheme
SCRIPT_NAME/PATH_INFO
QUERY_STRING
HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
response status
response headers (name and value)


What I'm getting at, though, is it's precisely this sort of "hm, 
which ones are bytes again?" stuff that makes you have to stop and 
*think*, i.e., it doesn't Fit My Brain any more.  ;-)


There should be one, and preferably *only* one, obvious way to do it.

And given that HTTP is inherently a bunch of bytes, bytes is the one 
obvious way.


I previously was under the impression that bytes wouldn't 
interoperate with strings in 3.x, but they *do*, in much the same way 
as they did in 2.x.  That means you'll be (mostly) bug-compatible in 
3.x, only you'll likely encounter encoding issues *sooner*, rather 
than later.  (i.e., the minute you combine non-ASCII inputs with your 
regular string constants).


Yes, you will also be forced to convert your return values to bytes, 
but if you've used string constants *anywhere*, then you know you'll 
be outputting text, which you should already have been encoding for 
output.  (So you'll just be forced to deal with errors on that side 
sooner as well.)


All in all, I'd say this also fits with what people on Python-Dev 
keep hammering on as the One Obvious Way to deal with bytes and 
strings in a program: i.e., bytes for I/O, text for text processing.


WSGI is HTTP, and HTTP is I/O, ergo, WSGI is I/O, and we should 
therefore "byte" the bullet here.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread P.J. Eby

At 05:42 PM 7/16/2010 -0400, Tres Seaver wrote:

P.J. Eby wrote:

> (Hm.  Although actually, I suppose we *could* just borrow the time
> machine and pretend that WSGI called for "byte-strings everywhere"
> all along...)

I like the idea of pushing responsibility for decoding stuff into the
framework / app writer's hands.  OTOH, doesn't that hose authors of
existing middleware, due to the borkedness of working with bytes in Python3?


It only creates a "new" problem if they are currently not using *any* 
unicode in 2.x, and are passing through bytes from the input to the 
output without any encoding or decoding.  AFAICT, if any part of 
their app is currently unicode, they would have the same problems in 2.x.


(Minus, of course, any problems introduced by missing bytes methods 
in 3.x, or the fact that single-subscripted bytes are ints rather 
than bytestrings.)


Anyway, the problems introduced will be problems that can be solved 
by waving a fairly standard set of dead chickens at the problem, i.e. 
picking where you're going to encode/decode, and deciding what 
encoding(s) are meaningful to your app.  And frameworks that already 
have a unicode API are ahead of the game here.


So, AFAICT, the only people who'd be punished by a change to bytes 
are the people who have non-ASCII inputs or outputs, but haven't been 
using unicode (because 2to3 will convert them to using strings 
instead of bytes).


From what I can tell, though, this is also the group it's most 
politically correct to hate on in Python-Dev, so we should be 
relatively safe in shifting the burden to them.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread P.J. Eby

At 07:20 PM 7/16/2010 -0400, Chris McDonough wrote:

I'd much rather say be able to say:

"""
The PATH_INFO environment variable is a ``bytes-with-benefits`` type.
To decode it:

- First, split it on slashes::

segments = PATH_INFO.split('/')

- Then, de-encode each segment's urlencoded portions:

urldecoded_segments = [ urllib.unquote(x) for x in segments ]

- Then re-encode each urldecoded segment into the encoding expected
  by your application

app_segments = [ str(x, encoding='utf-8') for x in
 urldecoded_segments ]
"""


+1.  I do wish we actually *had* a bytes-with-benefits type (as I 
proposed on Python-Dev), but I don't think we can really get one 
until the language moratorium is over.  Plain old bytes are the next 
best thing. 


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Armin Ronacher

Hi,

On 7/17/10 1:20 AM, Chris McDonough wrote:
> Let me know if I'm missing something.
The only thing you miss is that the bytes type of Python 3 is badly 
supported in the stdlib (not an issue if we reimplement everything in 
our libraries, not an issue for me) and that the bytes type has no 
string formattings which makes us do the encode/decode dance in our own 
implementation so of the missing stdlib functions.


So I am pretty sure we can't totally bypass the encoding/decoding.  We 
might however require less encodes/decodes if we leave bytes on the WSGI 
layer.



Regards,
Armin
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

P.J. Eby wrote:
> At 07:20 PM 7/16/2010 -0400, Chris McDonough wrote:
>> I'd much rather say be able to say:
>>
>> """
>> The PATH_INFO environment variable is a ``bytes-with-benefits`` type.
>> To decode it:
>>
>> - First, split it on slashes::
>>
>> segments = PATH_INFO.split('/')
>>
>> - Then, de-encode each segment's urlencoded portions:
>>
>> urldecoded_segments = [ urllib.unquote(x) for x in segments ]
>>
>> - Then re-encode each urldecoded segment into the encoding expected
>>   by your application
>>
>> app_segments = [ str(x, encoding='utf-8') for x in
>>  urldecoded_segments ]
>> """
> 
> +1.  I do wish we actually *had* a bytes-with-benefits type (as I 
> proposed on Python-Dev), but I don't think we can really get one 
> until the language moratorium is over.  Plain old bytes are the next 
> best thing. 

We might be able to write one which would work in reduce-instruction-set
mode, and have the server wrap the environ valuee in it.  Some
operations might not be "natural", and we might have to implement some
wrappers around stdlib stuff, but maybe it would be worthwhile to try a
spike on it.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxBA00ACgkQ+gerLs4ltQ4xlQCghykpuIBK97nwJczkZpddlrCf
rZQAoI6xRwsIo5jQiD781o8Q5Y5wxoSx
=4WBq
-END PGP SIGNATURE-

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Chris McDonough
On Sat, 2010-07-17 at 01:33 +0200, Armin Ronacher wrote:
> Hi,
> 
> On 7/17/10 1:20 AM, Chris McDonough wrote:
>  > Let me know if I'm missing something.
> The only thing you miss is that the bytes type of Python 3 is badly 
> supported in the stdlib (not an issue if we reimplement everything in 
> our libraries, not an issue for me) and that the bytes type has no 
> string formattings which makes us do the encode/decode dance in our own 
> implementation so of the missing stdlib functions.

This is why the docs mention "bytes with benefits" instead (like the
Python 2 "str" type). The existence of such a type would be the result
of us lobbying for its inclusion into some future Python 3, or at least
the result of lobbying for a String ABC that would allow us to define
our own.

But.. yeah.  Stdlib support for bytes.  Dunno.   What I really don't
want to do is implement a WSGI spec in terms of Unicodey strings just
because the webby stuff in the stdlib cannot deal with bytes.  Those
stdlib implementations should be changed to deal with bytes-ish things
instead.  I actually think fixing the stdlib will end up being a driver
for the "bytes with benefits" type.  Supporting such a type in the
implementation of stdlib functions is clearly the right way to fix it in
lots of cases, because they will be able to deal with BwB and
Unicodey-strings in exactly the same way.

In the meantime, I think using bytes is the only sane thing to do in
some interim specification, because moving from a spec which is
bytes-oriented to a spec that is text-oriented now will leave us in the
embarrassing position of needing to create yet another bytes-oriented
spec later (as, well, I/O is bytes), when Python 3 matures and realizes
it needs such a hybrid type.

- C


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 6:20 PM, Chris McDonough  wrote:

>  > What are the concrete problems you envision with text request headers,
> > text (URL-quoted) path, and text response status and headers?
>
> Documentation is the main reason.  For example, the documentation for
> making sense of path_info segments in a WSGI that used unicodey-strings
> would, as I understand it, read something like this:
>

Nah, not nearly that hard:

path_info =
urllib.parse.unquote_to_bytes(environ['wsgi.raw_path_info']).decode('UTF-8')

I don't see the problem?  If you want to distinguish %2f from /, then you'll
do it slightly differently, like:

path_parts = [
urllib.parse.unquote_to_bytes(p).decode('UTF-8')
for p in environ['wsgi.raw_path_info'].split('/')]

This second recipe is impossible to do currently with WSGI.

So... before jumping to conclusions, what's the hard part with using text?

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 8:46 PM, Ian Bicking  wrote:

> So... before jumping to conclusions, what's the hard part with using text?
>

Oh, the one thing that will be silly is cookies, but they are totally nuts
already.  They can be parsed equally well as bytes or latin1, and best only
transcoded after parsing.  Doing cookie_value.decode(app_encoding) or
cookie_value.encode('ISO-8859-1').decode(app_encoding) isn't terribly
different.  And cookies aren't fair because they are just stupid; like the
standard library I don't think we should design anything around their
idiosyncrasies.

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Chris McDonough
On Fri, 2010-07-16 at 20:46 -0500, Ian Bicking wrote:
> On Fri, Jul 16, 2010 at 6:20 PM, Chris McDonough 
> wrote:
> > What are the concrete problems you envision with text
> request headers,
> > text (URL-quoted) path, and text response status and
> headers?
> 
> 
> Documentation is the main reason.  For example, the
> documentation for
> making sense of path_info segments in a WSGI that used
> unicodey-strings
> would, as I understand it, read something like this:
> 
> Nah, not nearly that hard:
> 
> path_info =
> urllib.parse.unquote_to_bytes(environ['wsgi.raw_path_info']).decode('UTF-8')
> 
> I don't see the problem?  If you want to distinguish %2f from /, then
> you'll do it slightly differently, like:
> 
> path_parts = [
> urllib.parse.unquote_to_bytes(p).decode('UTF-8')
> for p in environ['wsgi.raw_path_info'].split('/')]
>  
> This second recipe is impossible to do currently with WSGI.
> 
> So... before jumping to conclusions, what's the hard part with using
> text?

It's extremely hard to swallow Python 3's current disregard for the
primacy of bytes at I/O boundaries.  I'm trying, but I can't help but
feel that the existence of an API like "unquote_to_bytes" is more
symptom treatment than solution.  Of course something that unquotes a
URL segment unquotes it into bytes; it's the only sane default because
URL segments found in URLs on the internet are bytes.

So I guess the "hard part" is more meta.  When you have legitimate
backwards compatibility constraints, suboptimal choices made during
protocol design are excusable.  But it just seems really very weird to
design one (WSGI 2) from scratch with such choices when the only reason
to do so is a systematic low-level denial of reality.  Why would we use
(and, worse, by doing so, implicitly promote) such a system in the first
place?

On the other hand, indignance about the issue shouldn't rule the day
either.  To me, the most pragmatic thing to do that doesn't deny reality
would be to use bytes.  It's also the easiest thing to remember (the
values in the environment are all bytes) and I think we'll be able to
drive the Py3K stdlib forward in a much saner direction if we choose
bytes than if we choose text to represent things that are naturally more
bytes-like.

- C

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Ian Bicking  wrote:
> On Fri, Jul 16, 2010 at 6:20 PM, Chris McDonough  wrote:
>
>
>
>> What are the concrete problems you envision with text request headers,
>> text (URL-quoted) path, and text response status and headers?
>
> Documentation is the main reason.  For example, the documentation for
> making sense of path_info segments in a WSGI that used unicodey-strings
> would, as I understand it, read something like this:
>
> Nah, not nearly that hard:
>
> path_info = 
> urllib.parse.unquote_to_bytes(environ['wsgi.raw_path_info']).decode('UTF-8')
>
> I don't see the problem?  If you want to distinguish %2f from /, then you'll 
> do it slightly differently, like:
>
> path_parts = [
>     urllib.parse.unquote_to_bytes(p).decode('UTF-8')
>     for p in environ['wsgi.raw_path_info'].split('/')]
>
> This second recipe is impossible to do currently with WSGI.
> So... before jumping to conclusions, what's the hard part with using

Sorry, it is not that simple. The thing that everyone is ignoring is
that SCRIPT_NAME and PATH_INFO are also normalized by the web server
normally. That is, .. instances are removed. By passing the raw URL
through to the application, you are now forcing every application to
have to deal with that as well with the possibility of directory
traversal attacks when people get it wrong and the URL is mapping
somehow to file system resources. It is a huge can of worms which at
the moment the web server deals with.

I have other issues with the raw stuff, but haven't got to read the
last dozen messages in this discussion as yet, so will leave those
points to another time.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 11:28 PM, Graham Dumpleton <
graham.dumple...@gmail.com> wrote:

> > Nah, not nearly that hard:
> >
> > path_info =
> urllib.parse.unquote_to_bytes(environ['wsgi.raw_path_info']).decode('UTF-8')
> >
> > I don't see the problem?  If you want to distinguish %2f from /, then
> you'll do it slightly differently, like:
> >
> > path_parts = [
> > urllib.parse.unquote_to_bytes(p).decode('UTF-8')
> > for p in environ['wsgi.raw_path_info'].split('/')]
> >
> > This second recipe is impossible to do currently with WSGI.
> > So... before jumping to conclusions, what's the hard part with using
>
> Sorry, it is not that simple. The thing that everyone is ignoring is
> that SCRIPT_NAME and PATH_INFO are also normalized by the web server
> normally. That is, .. instances are removed. By passing the raw URL
> through to the application, you are now forcing every application to
> have to deal with that as well with the possibility of directory
> traversal attacks when people get it wrong and the URL is mapping
> somehow to file system resources. It is a huge can of worms which at
> the moment the web server deals with.
>

Well... at least to me "raw" only means "not URL decoded", so it doesn't
necessarily mean you can't clean up the request path.  I guess an attacker
could encode "." to make things harder.

Nevertheless, WSGI servers don't currently guarantee this cleaning.  I added
it to paste.httpserver, but I don't know one way or the other about any
other servers.  A quick test shows wsgiref does not clean paths.  So apps
shouldn't rely on a clean path.


-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Ian Bicking
On Fri, Jul 16, 2010 at 9:43 PM, Chris McDonough  wrote:

> > Nah, not nearly that hard:
> >
> > path_info =
> >
> urllib.parse.unquote_to_bytes(environ['wsgi.raw_path_info']).decode('UTF-8')
> >
> > I don't see the problem?  If you want to distinguish %2f from /, then
> > you'll do it slightly differently, like:
> >
> > path_parts = [
> > urllib.parse.unquote_to_bytes(p).decode('UTF-8')
> > for p in environ['wsgi.raw_path_info'].split('/')]
> >
> > This second recipe is impossible to do currently with WSGI.
> >
> > So... before jumping to conclusions, what's the hard part with using
> > text?
>
> It's extremely hard to swallow Python 3's current disregard for the
> primacy of bytes at I/O boundaries.  I'm trying, but I can't help but
> feel that the existence of an API like "unquote_to_bytes" is more
> symptom treatment than solution.  Of course something that unquotes a
> URL segment unquotes it into bytes; it's the only sane default because
> URL segments found in URLs on the internet are bytes.
>

Yes, URL quoted strings should decode to bytes, though arguably it is
reasonable to also use the very reasonable UTF-8 default that
urllib.parse.quote/unquote uses.  So it's really just a question of names,
should be quote_to_string or quote_to_bytes that name.  Which honestly...
whatever.

So I guess the "hard part" is more meta.  When you have legitimate
> backwards compatibility constraints, suboptimal choices made during
> protocol design are excusable.  But it just seems really very weird to
> design one (WSGI 2) from scratch with such choices when the only reason
> to do so is a systematic low-level denial of reality.  Why would we use
> (and, worse, by doing so, implicitly promote) such a system in the first
> place?
>
> On the other hand, indignance about the issue shouldn't rule the day
> either.  To me, the most pragmatic thing to do that doesn't deny reality
> would be to use bytes.  It's also the easiest thing to remember (the
> values in the environment are all bytes) and I think we'll be able to
> drive the Py3K stdlib forward in a much saner direction if we choose
> bytes than if we choose text to represent things that are naturally more
> bytes-like.
>

I do feel like indignance has played a part here.  And in my brief forays
into Python 3 I have been frustrated by the over-textification of APIs.
But... if a compromise works let's not let those experiences color our
choices.

So, here's my criteria for resolving this particular Python 3 issue:

* We should not lose information from the request.  Decoding with UTF-8
(without surrogateescape) would be an example.  URL-decoding loses us
information currently; which is why I wouldn't be sad to see it go (though
if it was only for that reason I wouldn't bother -- the unicode issue just
makes it serendipitous).

* We shouldn't produce wildly inaccurate strings.  E.g., decoding something
with Latin1 when it's an implausible encoding.

* Encoding/decoding errors should only possibly happen at the application
level, or maybe middleware if you are playing around with stuff.  Servers
specifically should never have them (because they can't gracefully handle
them).

* We should avoid server configuration with respect to application policy
(we've avoided it so far, yay!)

* We should support eclectic application layouts, e.g., an application that
sometimes serves Latin-1, sometimes UTF-8 (like if the application proxies
requests or serves up legacy content/apps).

* We should make things as easy to port as possible.  Errors in porting
should be loud.

* As much as possible WSGI should be readable and usable.  Maybe most people
will use a library, but we also have a lot of libraries that handle WSGI,
and it's nice that's been able to happen, so we don't want to make things
any harder than they have to be.  E.g., clearly we should use text environ
keys (luckily we don't have to worry about non-ASCII header names, I guess?)

-- 
Ian Bicking  |  http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Friday, July 16, 2010, And Clover  wrote:
> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>
> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
>
>
> (And of those, PATH_INFO is the only one that really matters, in that no-one 
> really uses non-ASCII script filenames,

FWIW, I had to go to a lot of trouble to allow non ASCII in final
SCRIPT_NAME in mod_wsgi. Specifically using AddHandler directive in
Apache means a file system path can make up part of SCRIPT_NAME. I had
someone who was specifically using Russian in a WSGI script file name
and because with AddHandler that becomes part of SCRIPT_NAME you had
to cater for it. Anyway this was more of a Windows issue in having to
use special file system functions to deal with fact that on Windows
filesystem paths aren't UTF-8 but something else.

What this does highlight though is that although one can talk about
passing raw script name through to application, that isn't necessarily
right as it isn't the application that dictates what encoding may be
used but the web server which is performing the mapping of that part
of the original URL path to a potential filesystem resource, or
alternatively where file based configuration for mount point, the
encoding of the web sever configuration file.

We touched on all of this before in prior discussions, thus original
raw value is only relevant in PATH_INFO and not SCRIPT_NAME as in the
case of the latter it is the web server that dictates the charset
based on configuration file encoding or file system encoding.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Ian Bicking  wrote:
> On Fri, Jul 16, 2010 at 4:33 AM, And Clover  wrote:
>
>
> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>
> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
>
>
>
> (And of those, PATH_INFO is the only one that really matters, in that no-one 
> really uses non-ASCII script filenames, and non-ASCII characters in 
> Cookie/Set-Cookie are still handled so differently/brokenly across browsers 
> that you can't rely on them at all.)
>
>
>
>
> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions
>
>
>
> For compatibility with existing apps, how about keeping the existing 
> SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying 
> that the new 'raw' versions (whatever they are called) are added only if they 
> really are raw, not reconstructed.
>
> Having two ways of expressing the same information will lead to bugs related 
> to which data is canonical.  If an application is using SCRIPT_NAME/PATH_INFO 
> and then updates those values in any way, and 
> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be weird 
> bugs and code will disagree about which one is correct.  Since %2f can exist 
> in the raw versions, there isn't even a way to chunk the two variables in the 
> same way.
>
>
> Then existing scripts that don't care about non-ASCII and slashes can carry 
> on as before, and for apps that do care about them, they'll be able to be 
> *sure* the input is correct. Or they can fall back to PATH_INFO when not 
> present, and avoid producing these kind of URLs in response.
>
> I don't think it works to imagine you can just not care about non-ASCII.  
> Requests come in.  WSGI should represent those requests.  If a request comes 
> in with non-ASCII bytes then WSGI needs to do *something* with it.  I don't 
> want to have to configure servers with application policy; servers should 
> just work.
>
> And this doesn't help with Python 3: either we have byte values of 
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think bytes 
> will be more awkward to port to than text, and inconsistent with other WSGI 
> values.  If we have text then we have to choose an encoding.  Latin1 will 
> work, but it will be the exact wrong encoding most of the time as UTF-8 is 
> the typical  (unlike other headers, where Latin1 will mostly be an okay 
> encoding, or as good a guess as we have).  If we firmly remove these keys 
> then we can avoid this choice entirely... and we conveniently also get a 
> better representation of the request.

One reason I don't want to see the existing keys removed is for
debugging purposes. In Apache, various Apache modules such as
mod_rewrite will operate on that translated path. I am concerned that
if only the raw one is available in the WSGI application then
confusion may arise where something doesn't go right with rewrites
because the only information that may be able to be dumped in the way
of debug by an application will be different to what other Apache
modules may operate on. If you aren't going to make use of CGI
versions, then would still like to see them present but perhaps
renamed. That way you don't have a loss of information when it comes
to trying to debug stuff. I could perhaps just put this in a
Apache/mod_wsgi specific key as well given that the issue is
particular to it. Thus might have apache.path_info or cgi.path_info.

Graham

> Note that libraries can smooth over this change; WebOb for instance will 
> certainly still support req.script_name/req.path_info by decoding the raw 
> values.  Admittedly lots of code use these values directly... but at least if 
> they get a KeyError the port/fix will be obvious (as opposed to out of sync 
> values, which will only emerge as a problem occasionally -- I'd rather not 
> invite more occasional bugs).
>
> --
> Ian Bicking  |  http://blog.ianbicking.org
>
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Ian Bicking  wrote:
> On Fri, Jul 16, 2010 at 12:28 PM, Chris McDonough  wrote:
>
>
> On Fri, 2010-07-16 at 11:07 -0500, Ian Bicking wrote:
>
>> And this doesn't help with Python 3: either we have byte values of
>> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I
>> think bytes will be more awkward to port to than text, and
>> inconsistent with other WSGI values.  If we have text then we have to
>> choose an encoding.  Latin1 will work, but it will be the exact wrong
>> encoding most of the time as UTF-8 is the typical  (unlike other
>> headers, where Latin1 will mostly be an okay encoding, or as good a
>> guess as we have).  If we firmly remove these keys then we can avoid
>> this choice entirely... and we conveniently also get a better
>> representation of the request.
>
> My $.02: I'd rather lobby the core folks for a string ABC (which we can
> hook with a stringlike bytes type) and consider all 3.X releases made so
> far "dead to WSGI" than to have to tunnel arbitrary bytes through some
> misleading Unicode encoding.
>
> While I think it would be generally useful, it's also a long way off at best, 
> with serious performance dangers that could torpedo the whole thing.  But... 
> I'm also unsure how it would help here, except perhaps we could incrementally 
> annotate bytes with an encoding?  Well, I don't really know.  Treating the 
> raw request path as text is easy enough, as it should always be ASCII 
> anyway.  We don't have to worry what is "right" or "wrong" in this case.
>
> We could make everything bytes and be done with it, but it would make it much 
> harder to port Python 2 WSGI code to Python

FWIW, I see the whole ebytes discussion only relevant were you to make
absolutely everything bytes. We don't really need it otherwise.

Graham
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Gustavo Narea  wrote:
> Hello,
>
> Ian said:
>> Having two ways of expressing the same information will lead to bugs
>> related to which data is canonical.  If an application is using
>> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
>> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
>> weird bugs and code will disagree about which one is correct.  Since %2f
>> can exist in the raw versions, there isn't even a way to chunk the two
>> variables in the same way.
>
> I can't agree more.
>
> I would propose the following, and excuse me in advance if this has already
> been proposed and discarded -- I've tried to follow this topic on the mailing
> list over the past few months, until it becomes an endless discussion.
>
> I think only the raw values should be available. Even if a middleware changes
> them, it must put them with raw values. And because you cannot change those
> values without knowing what encoding the request uses, the character encoding
> *must* be present.
>
> I know that sounds easy but it's not, because browsers don't specify the
> charset in the Content-Type and instead they generate a new request using the
> charset from the previous response. So the charset is unknown to the
> server/gateway and the middleware stack.
>
> So, what we could do is introduce a mandatory variable called, say,
> wsgi.charset, and would be used as follows:

Something like this was proposed before, but it only applied to the
keys that mattered, specifically PATH_INFO and maybe QUERY_STRING,
(the latter of which this discussion has been ignoring and I can't
remember how we worked out before it should be treated). It didn't
cover SCRIPT_NAME as as I indicated before, the encoding of that is
really dictated by the server and not the application for the initial
value at least.

The idea was that the server would pass them as Latin 1 and set the
encoding key. If a consumer of it didn't like the encoding it was in,
it would convert it back to bytes and then to what it wants and update
the encoding key to what it used. Thus you had a hint available to
allow reliable transcoding. This proposal didn't get acceptance
either.

Graham

>  - It MUST be set by the server or gateway on every request.
>  - Every middleware or application that reads or writes these values MUST use
> the charset specified in wsgi.charset.
>  - If a server, gateway, middleware or application wants to change the charset
> and it is possible*, it MUST convert the *entire* request into that charset
> and update wsgi.charset accordingly.
>  - When the charset is not specified in the HTTP request, UTF-8 MUST be
> assumed by the server/gateway. Unless another default charset has been
> specified by the user.
>
> I think/hope that will solve all the problems.
>
> What happens when a WSGI application is actually made up two WSGI applications
> and they send the responses in different charsets? If it's not possible to
> configure them so that they both use the same charsets, then one of them would
> have to be wrapped by a middleware which:
>  - On egress, converts the responses using the charset used by the other
> application.
>  - On ingress, if the charset is not specified in the request, it will assume
> it's the one used by the other application, and thus it will convert the
> request using the charset supported by the wrapped application.
>
> It would look like this:
> ===
> def application(environ, start_response):
>     if environ.startswith("/trac/"):
>         # Say Trac only supports Latin-1 and we want responses to use UTF-8:
>         app = trac.web.main.dispatch_request
>         app = CharsetNormalizer(app, response="latin-1", request="utf8")
>     else:
>         # myapp uses UTF-8
>         app = myapp
>     return app(environ, start_response)
> ===
>
> Then there's the string vs bytes issue. Bytes would be the natural choice to
> represent these raw values, but it would probably cause more trouble than they
> solve. So, I think they should be strings that contain the the ASCII raw
> encoded values (i.e., str on both versions of Python).
>
> What do you think about this? Again, sorry if this has been discarded before!
> :)
>
> * For example, you can always convert Latin-1 to UTF-8, but not every UTF-8
> string can be converted to Latin-1.
> --
> Gustavo Narea .
> | Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |
> ___
> Web-SIG mailing list
> Web-SIG@python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: 
> http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Ian Bicking  wrote:
> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby  wrote:
>
>
> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>
> And this doesn't help with Python 3: either we have byte values of 
> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think 
> bytes will be more awkward to port to than text, and inconsistent with other 
> WSGI values.
>
>
> OTOH, it has the tremendous advantage of pushing the encoding question onto 
> the app (or framework) developer...  who's really the only one who can make 
> the right decision for their particular application.  And personally, I'd 
> rather have clear boundaries between text and bytes, such that porting (even 
> if tedious or awkward) is *consistent*, and clear as to when you're finished, 
> not, "oh, did I check to make sure I converted SCRIPT_NAME and PATH_INFO...  
> not just in my app code, but in all the library code I call *from* my app?"
>
> IOW, the bytes/string discussion on Python-dev has kind of led me to realize 
> that we might just as well make the *entire* stack bytes (incoming and 
> outgoing headers *and* streams), and rewrite that bit in PEP 333 about using 
> str on "Python 3000" to say we go with bytes on Python 3+ for everything 
> that's a str in today's WSGI.
>
> This was my first intuition too, until I started thinking in more detail 
> about the particular values involved.  Some obviously are textish, like 
> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>
> Basically all the internal strings are textish, so we're left with:
>
> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
>
> And there's a few things like REMOTE_USER that are kind of in the middle.  
> Everyone is in agreement that bodies should be bytes.
>
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for 
> instance there's no good way to reconstruct the URL using the stdlib.  That 
> explains certain tensions, but I think we should ignore that, and in fact 
> that's what Python-Dev seemed to say pretty clearly.
>
> Now, the other keys:
>
> wsgi.url_scheme: clearly ASCII
>
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old 
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL 
> encoding happens at the byte layer, so a server could reasonably URL encode 
> any non-ASCII characters without imposing any  encoding.
>
> QUERY_STRING: should be ASCII, same as raw request path
>
> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by 
> the specification.  The spec also implies you have use the RFC2047 inline 
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and 
> supporting it would probably be a bad idea for security reasons.  The Atompub 
> spec (reasonably modern) specifically says Title headers should be encoded 
> with RFC2047 (if they are not ISO-8859-1): 
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- decoding 
> this kind of encoding at the application layer seems reasonable to me.
>
> cookie header: this specific header can easily have multiple encodings, as 
> the browser encodes data then treats it as opaque bytes, so a cookie can be 
> set via UTF-8 one place, Latin1 another, and those coexist in one header.  
> That is, there is no real encoding and this should be treated as bytes.  
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but 
> entirely workable.)
>
> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In 
> practice it is almost always ASCII, and since it is not user-visible it's not 
> something that really needs localization.
>
> response headers: the spec implies Latin1, in practice the Set-Cookie header 
> is bytes (since interoperation with wonky legacy systems is not uncommon).  
> I'm not sure of any other exceptions?
>
>
> So... to me it seems pretty reasonable for HTTP specifically that text can 
> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and 
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] should 
> be in that mode.  And it would also be weird if environ['SERVER_NAME'] was 
> bytes.
>
> In the past when we've gotten down to specifics, the only holdup has been 
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

There were a few other weird ones which are though server specific.
For example PATH_TRANSLATED (??). These are ones where again the
server or operating system dictates the encoding due to them having
bits in them deriving from things like filesystem paths and server
configuration files. I laboriously went through all these in an email
last year or earlier.

Same reason why SCRIPT_NAME is really dictated by server and raw value
perhaps should be going through to application.

Graham
_

Re: [Web-SIG] WSGI for Python 3

2010-07-16 Thread Graham Dumpleton
On Saturday, July 17, 2010, Graham Dumpleton  wrote:
> On Saturday, July 17, 2010, Ian Bicking  wrote:
>> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby  wrote:
>>
>>
>> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>>
>> And this doesn't help with Python 3: either we have byte values of 
>> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think 
>> bytes will be more awkward to port to than text, and inconsistent with other 
>> WSGI values.
>>
>>
>> OTOH, it has the tremendous advantage of pushing the encoding question onto 
>> the app (or framework) developer...  who's really the only one who can make 
>> the right decision for their particular application.  And personally, I'd 
>> rather have clear boundaries between text and bytes, such that porting (even 
>> if tedious or awkward) is *consistent*, and clear as to when you're 
>> finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and 
>> PATH_INFO...  not just in my app code, but in all the library code I call 
>> *from* my app?"
>>
>> IOW, the bytes/string discussion on Python-dev has kind of led me to realize 
>> that we might just as well make the *entire* stack bytes (incoming and 
>> outgoing headers *and* streams), and rewrite that bit in PEP 333 about using 
>> str on "Python 3000" to say we go with bytes on Python 3+ for everything 
>> that's a str in today's WSGI.
>>
>> This was my first intuition too, until I started thinking in more detail 
>> about the particular values involved.  Some obviously are textish, like 
>> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>>
>> Basically all the internal strings are textish, so we're left with:
>>
>> wsgi.url_scheme
>> SCRIPT_NAME/PATH_INFO
>> QUERY_STRING
>> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
>> response status
>> response headers (name and value)
>>
>> And there's a few things like REMOTE_USER that are kind of in the middle.  
>> Everyone is in agreement that bodies should be bytes.
>>
>> One initial problem is that the Python 3 stdlib handles bytes poorly, so for 
>> instance there's no good way to reconstruct the URL using the stdlib.  That 
>> explains certain tensions, but I think we should ignore that, and in fact 
>> that's what Python-Dev seemed to say pretty clearly.
>>
>> Now, the other keys:
>>
>> wsgi.url_scheme: clearly ASCII
>>
>> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old 
>> legacy encoding.
>> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL 
>> encoding happens at the byte layer, so a server could reasonably URL encode 
>> any non-ASCII characters without imposing any  encoding.
>>
>> QUERY_STRING: should be ASCII, same as raw request path
>>
>> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by 
>> the specification.  The spec also implies you have use the RFC2047 inline 
>> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and 
>> supporting it would probably be a bad idea for security reasons.  The 
>> Atompub spec (reasonably modern) specifically says Title headers should be 
>> encoded with RFC2047 (if they are not ISO-8859-1): 
>> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- 
>> decoding this kind of encoding at the application layer seems reasonable to 
>> me.
>>
>> cookie header: this specific header can easily have multiple encodings, as 
>> the browser encodes data then treats it as opaque bytes, so a cookie can be 
>> set via UTF-8 one place, Latin1 another, and those coexist in one header.  
>> That is, there is no real encoding and this should be treated as bytes.  
>> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but 
>> entirely workable.)
>>
>> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In 
>> practice it is almost always ASCII, and since it is not user-visible it's 
>> not something that really needs localization.
>>
>> response headers: the spec implies Latin1, in practice the Set-Cookie header 
>> is bytes (since interoperation with wonky legacy systems is not uncommon).  
>> I'm not sure of any other exceptions?
>>
>>
>> So... to me it seems pretty reasonable for HTTP specifically that text can 
>> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and 
>> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] 
>> should be in that mode.  And it would also be weird if 
>> environ['SERVER_NAME'] was bytes.
>>
>> In the past when we've gotten down to specifics, the only holdup has been 
>> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
>
> There were a few other weird ones which are though server specific.
> For example PATH_TRANSLATED (??). These are ones where again the
> server or operating system dictates the encoding due to them having
> bits in them deriving from things like filesystem paths and server
> configuration files. I laboriously went through all these in