Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread And Clover

Manlio Perillo wrote:


Words of *TEXT MAY contain characters from character sets other than
ISO-8859-1 [22] only when encoded according to the rules of RFC 2047


Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to 
RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself 
specifically denies that an encoded-word can go in a quoted-string.


RFC2047 encoded-words are not on-topic in an HTTP header(*); this has 
been confirmed by newer development work on HTTPbis by Reschke et al. 
(http://tools.ietf.org/wg/httpbis/).


The "correct" way of escaping header parameters in an RFC*822-family 
protocol would be RFC2231's complex encoding scheme, but HTTP is 
explicitly not an 822-family protocol despite sharing many of the same 
constructs. See 
http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a 
strategy for how 2231 should interact with HTTP, but note that for now 
RFC2231-in-HTTP simply does not exist in any deployed tools.


So for now there is basically nothing useful WSGI can do other than 
provide direct, byte-oriented (even if wrapped in 8859-1 unicode 
strings) access to headers.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread Henry Precheur
On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote:
> There is something that I don't understand.
> 
> Some HTTP headers, like Accept-Language, contains data described as
> `token`, where:
> 
> token  = 1*
> 
> So a token, IMHO, is an opaque string, and it SHOULD not decoded.
> In Python 3.x it SHOULD be a byte string.

I think this is more an issue that frameworks should deal with. By
decoding every headers value to latin-1:

* It keeps WSGI simple. Simple is good.

* WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
  says. WSGI is about HTTP, but that doesn't necessarily includes all
  other standards extending HTTP.

* It's possible to convert latin-1 strings to bytes without losing data.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread Manlio Perillo
And Clover ha scritto:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>   url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes
> to be trivially extracted. This is consistent with the other headers and
> would be my preferred approach.
> 

There is something that I don't understand.

Some HTTP headers, like Accept-Language, contains data described as
`token`, where:

token  = 1*

So a token, IMHO, is an opaque string, and it SHOULD not decoded.
In Python 3.x it SHOULD be a byte string.

Text content is described as `TEXT`, where:

The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].

TEXT   = 


The only type of data where TEXT can be used is `quoted-string`.

A `quoted-string` only appears in well specified portions of an header.
So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP
headers as Unicode strings.

This is up to the application/framework, that must parse each header,
split it in component and handle them as more appropriate (as byte
string, Unicode string or instance of some other data type).


> [...]


Regards   Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread Henry Precheur
On Thu, Dec 03, 2009 at 08:33:19PM +0100, Manlio Perillo wrote:
> Right now I'm doing a: username.decode('us-ascii', 'replace')

Or like most frameworks you could let the application author deal with
the problem, just pass the raw strings to the application.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread Manlio Perillo
Henry Precheur ha scritto:
> [...]
>> How is authorization username handled in common WSGI frameworks?
> 
> As far as I know, they don't handle this. They just return the string
> without dealing with the encoding issues.
> 
> I think there is no correct way of handling this, because 99% of
> username/password contain only ascii characters. A possible 'workaround'
> would be to limit yourself to the ascii charset. If you get a non-ascii
> character raise an Exception.
> 

Right now I'm doing a: username.decode('us-ascii', 'replace')



Regards  Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread Henry Precheur
On Thu, Dec 03, 2009 at 05:09:31PM +0100, Manlio Perillo wrote:
> This is really a mess.

RFC 2617 doesn't specify any encoding for its headers, so it should be
latin-1 everywhere. But on the web nobody respect standards.

> How is authorization username handled in common WSGI frameworks?

As far as I know, they don't handle this. They just return the string
without dealing with the encoding issues.

I think there is no correct way of handling this, because 99% of
username/password contain only ascii characters. A possible 'workaround'
would be to limit yourself to the ascii charset. If you get a non-ascii
character raise an Exception.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread Henry Precheur
On Thu, Dec 03, 2009 at 07:35:14PM +0100, And Clover wrote:
> >I don't know what the HTTP/Cookie spec says about this.
> 
> The traditional interpretation of RFC2616 is that headers are ISO-8859-1.
> 
> You will notice that no browser correctly follows this.

The RFC 2109 & 2965 say that a cookie's value can be anything:

> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.

Theoricaly you could put something like: 'foo\n\0bar' in a cookie.

Also a cookie can include comments which have to be encoded using ...
UTF-8:

> Comment=value
>   OPTIONAL.  Because cookies can be used to derive or store
>   private information about a user, the value of the Comment
>   attribute allows an origin server to document how it intends to
>   use the cookie.  The user can inspect the information to decide
>   whether to initiate or continue a session with this cookie.
>   Characters in value MUST be in UTF-8 encoding.

-- 
  Henry Prêcheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread And Clover

Manlio Perillo wrote:


I have written a simple WSGI application that asks authentication
credentials


Ho ho! This is another area that is Completely Broken Everywhere. It's 
actually a similar situation to the cookies:


- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
  mangling any characters that don't fit in the codepage through the
  traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
  gets through but everything else is mangled)
- Safari uses ISO-8859-1, and refuses to send any cookie containing
  characters outside the 8859-1 repertoire.
- Konqueror uses ISO-8859-1, and replaces any non-8859-1 character
  with a question mark.

The HTTP standard has nothing to say about the encoding in use *inside* 
the base64-encoded Authorization byte-string token. It's anyone's guess, 
and every browser has guessed differently. (Safari here is at least 
slightly better than its behaviour with the cookies.)


> (and I suspect that [IE] always use this encoding, instead of
> iso-8859-1).

It will certainly never send ISO-8859-1, but what it does send is locale 
dependent. Type an e-acute in your username on a Western machine and 
it'll send one byte sequence; type the same thing on an Eastern European 
Windows install and you'll get something quite different.



Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac'



I don't know where \xac come from


It's the low byte of UCS-2 codepoint U+20AC (EURO SIGN). Firefox simply 
discards the top 8 bits of each codepoint.



Unfortunately I can not test with IE 7 and 8.


The behaviour has not changed.

> This is really a mess.

Isn't it.

> How is authorization username handled in common WSGI frameworks?

No-one supports non-ASCII characters in Authentication. Most web authors 
simply move to cookies instead.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread James Y Knight
On Dec 3, 2009, at 1:35 PM, And Clover wrote:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>  url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be 
> trivially extracted. This is consistent with the other headers and would be 
> my preferred approach.

Right, for WSGI 1.1 on Python 3.x, 8859-1 strings is the plan. Other, more 
ideologically pure options can be discussed for an incompatible revision of 
WSGI (e.g. the hypothetical 2.0).

BTW: I hope to have a first draft of the changes by Monday. (But don't beat up 
on me if it's delayed; I am working on it.)

James
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread Manlio Perillo
And Clover ha scritto:
> [...]
>> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
>> going to assume that data is encoded in latin-1.
> 
> Yeah. This is no big deal because non-ASCII characters in cookies are
> already broken everywhere(*). Given this and other limitations on what
> characters can go in cookies, they are habitually encoded using ad-hoc
> mechanisms handled by the application (typically a round of URL-encoding).
> 
> *: in particular:
> 
> - Opera and Chrome send non-ASCII cookie characters in UTF-8.
> - IE encodes using the system codepage (which can never be UTF-8),
>   mangling any characters that don't fit in the codepage through the
>   traditional Windows 'similar replacement character' scheme.
> - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
>   gets through but everything else is mangled)
> - Safari refuses to send any cookie containing non-ASCII characters.
> 

Thanks for this summary.
I think it should go in a wiki or in a separate document (like
rationale) to the WSGI spec.

However this should never happen with cookie, since cookie data is
opaque to browser, and it MUST send it "as is".

What you describe happen with other headers containing TEXT.
And now I understand that strange behaviour of Firefox with non latin-1
strings in username, in HTTP Basic Authentication.

> [...]

Regards   Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread And Clover

Manlio Perillo wrote:


However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using



  url.decode('utf-8', 'surrogateescape')



Is this correct?


The currently-discussed proposal is ISO-8859-1, allowing the real bytes 
to be trivially extracted. This is consistent with the other headers and 
would be my preferred approach.


Python 3.1's wsgiref.simple_server, on the other hand, blindly uses 
urllib.unquote, which defaults to UTF-8 without surrogateescape, 
mangling any non-UTF-8 input.


I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding 
is blessed. But *something* needs to be blessed. An encoding, an 
alternative undecoded path_info, both, something else... just *something*.



Let's consider the `wsgiref.util.application_uri` function
There is a potential problem, here, with the quote function.


Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 
3.0, but still broken. Until we can come to a Pronouncement on what WSGI 
*is* in Python 3, it is meaningless anyway.



Cookie data SHOULD be transparent to the server/gateway; however WSGI is
going to assume that data is encoded in latin-1.


Yeah. This is no big deal because non-ASCII characters in cookies are 
already broken everywhere(*). Given this and other limitations on what 
characters can go in cookies, they are habitually encoded using ad-hoc 
mechanisms handled by the application (typically a round of URL-encoding).


*: in particular:

- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
  mangling any characters that don't fit in the codepage through the
  traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
  gets through but everything else is mangled)
- Safari refuses to send any cookie containing non-ASCII characters.


I don't know what the HTTP/Cookie spec says about this.


The traditional interpretation of RFC2616 is that headers are ISO-8859-1.

You will notice that no browser correctly follows this.

...sigh.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] HTTP headers encoding

2009-12-03 Thread Manlio Perillo
Manlio Perillo ha scritto:
> Hi.
> 
> I'm doing some tests to try to understand how HTTP headers are encoded
> by browsers.
> 
> I have written a simple WSGI application that asks authentication
> credentials and then print them on the terminal and return the data as
> response, as raw bytes
> http://paste.pocoo.org/show/154633/
> 

I'm now testing using HTTP Digest Authentication.
The application is here:
http://paste.pocoo.org/show/154667/

It uses my wsgix framework
http://hg.mperillo.ath.cx/wsgix/
since I don't want to rewrite the entire Digest Authentication handling.


As user name I use the the string "àè€".
The results are:

- Firefox does not send any request, and instead it show me the returned
  response body "Authentication required".

  This is quite strange.

- Internet Explorer 6 encode the username using cp1252, as always.

- Opera (10.01) encode the username using utf-8

I can not test with Konqueror, since the wsgiref server have problems
with it.


All these implementation are against the HTTP spec.
username is a quoted string, and so it SHOULD be encoded using the
default latin-1, or another charset and in this case it should be
formatted as specified my MIME (unfortunately there are no examples in
the HTTP spec).


This is really a mess.
How is authorization username handled in common WSGI frameworks?




Thanks  Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] HTTP headers encoding

2009-12-03 Thread Manlio Perillo
Hi.

I'm doing some tests to try to understand how HTTP headers are encoded
by browsers.

I have written a simple WSGI application that asks authentication
credentials and then print them on the terminal and return the data as
response, as raw bytes
http://paste.pocoo.org/show/154633/

Then I used some browsers to try to send an username with non ascii
characters.


When I try with simple characters in the iso-8859-1 charset, things
works well; the data is encoded using this charset.

However when I try to use some extraneus character, like Euro, there are
problems.

Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a
'\xac'

I don't know where \xac come from, but it is the last byte in the utf-8
encoded Euro: '\xe2\x82\xac'


Internet Explorer 6.0 sends me a
'\x80'
and this this the Euro characted encoded using cp1252 (and I suspect
that it always use this encoding, instead of iso-8859-1).

Unfortunately I can not test with IE 7 and 8.



With a browser working on a terminal, like lynx, things get worse.
If I enter as user name the string "àè", lynx sends me
'\xc3\xa0\xc3\xa8'

This happens in a GNOME terminal, with an it_IT.utf8 locale.

wget and curl do the same.


Can someone else reproduce this?



Thanks   Manlio
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec

2009-12-03 Thread Manlio Perillo
James Y Knight ha scritto:
> I move to bless mod_wsgi's definition of WSGI 1.1 [1]
> [...]
> 
> [1] http://code.google.com/p/modwsgi/wiki/SupportForPython3X

Hi.

Just a few questions.

It is true that HTTP headers can be encoded assuming latin-1; and they
can be encoded using PEP 383.

However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using

  url.decode('utf-8', 'surrogateescape')

Is this correct?


Now another question.
Let's consider the `wsgiref.util.application_uri` function

def application_uri(environ):
url = environ['wsgi.url_scheme']+'://'
from urllib.parse import quote

if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']

if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']

url += quote(environ.get('SCRIPT_NAME') or '/')
return url


There is a potential problem, here, with the quote function.
This function does the following:

def quote(string, safe='/', encoding=None, errors=None):
if isinstance(string, str):
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'strict'
string = string.encode(encoding, errors)

This means that if we use surrogateescape, the informations about
original bytes is lost here.

This can be easily fixed by changing the application_uri function, but
this also means that a WSGI application will not work with Python 3.1.x.


Finally, a question about cookies.
Cookie data SHOULD be transparent to the server/gateway; however WSGI is
going to assume that data is encoded in latin-1.

I don't know what the HTTP/Cookie spec says about this.
However, from a WSGI application point of view, the cookie data can, as
an example, contain some text encoded in UTF-8; this means that the
application must first encode the data:

  cookie_bytes = cookie.encode('latin-1', 'surrogateescape')

and then decode it using UTF-8:

  my_cookie_data = cookie_bytes.decode('utf-8')


This is a bit unreasonable, but I don't know if this is a common
practice (I do this, just to make an example).



Manlio Perillo
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com