Re: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

2014-01-04 Thread martin


Quoting Chris Angelico :


I'm pretty sure that server is in violation of the spec, so all bets
are off as to what any other server will do. If you know you're
dealing with this one server, you can probably hack around this, but I
don't think it belongs in core code. Unless, of course, I'm completely
wrong about the spec, or if there's a de facto spec that lots of
servers follow, in which case maybe it would be worth doing.


It would be possible to support this better by using "ascii" with
"surrogateescape" when receiving the redirect, and using the same
for all URLs coming into http.client. This would implement a
best-effort strategy at preserving the bogus URL, and still maintain
the notion that URLs are text (with the other path being to also
allow bytes as URLs, and always parsing Location as bytes).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

2014-01-04 Thread Xavier Morel

On 2014-01-04, at 17:24 , Chris Angelico  wrote:

> On Sun, Jan 5, 2014 at 2:36 AM, Hugo G. Fierro  wrote:
>> I am trying to download an HTML document. I get an HTTP 301 (Moved
>> Permanently) with a UTF-8 encoded Location header and http.client decodes it
>> as iso-8859-1. When there's a non-ASCII character in the redirect URL then I
>> can't download the document.
>> 
>> In client.py def parse_headers() I see the call to decode('iso-8859-1'). My
>> personal  hack is to use whatever charset is defined in the Content-Type
>> HTTP header (utf8) or fall back into iso-8859-1.
>> 
>> At this point I am not sure where/how a fix should occur  so I thought I'd
>> run it by you in case I should file a bug. Note that I don't use http.client
>> directly, but through the python-requests library.
> 
> I'm not 100% sure, but I believe non-ASCII characters are outright
> forbidden in a Location: header. It's possible that an RFC2047 tag
> might be used, but my reading of RFC2616 is that that's only for text
> fields, not for Location. These non-ASCII characters ought to be
> percent-encoded, and anything doing otherwise is buggy.

That is also my reading, the Location field’s value is defined as an
absoluteURI (RFC2616, section 14.30):

> Location = "Location" ":" absoluteURI

section 3.2.1 indicates that "absoluteURI" (and other related
concepts) are used as defined by RFC 2396 "Uniform Resource
Identifiers (URI): Generic Syntax", that is:

> absoluteURI = scheme ":" ( hier_part | opaque_part )

both "hier_part" and "opaque_part" consist of some punctuation
characters, "escaped" and "unreserved". "escaped" is %-encoded
characters which leaves "unreserved" defined as "alphanum | mark".
"mark" is more punctuation and "alphanum" is ASCII's alphanumeric
ranges.

Furthermore, although RFC 3986 moves some stuff around and renames some
production rules, it seems to have kept this limitation.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

2014-01-04 Thread Chris Angelico
On Sun, Jan 5, 2014 at 2:36 AM, Hugo G. Fierro  wrote:
> I am trying to download an HTML document. I get an HTTP 301 (Moved
> Permanently) with a UTF-8 encoded Location header and http.client decodes it
> as iso-8859-1. When there's a non-ASCII character in the redirect URL then I
> can't download the document.
>
> In client.py def parse_headers() I see the call to decode('iso-8859-1'). My
> personal  hack is to use whatever charset is defined in the Content-Type
> HTTP header (utf8) or fall back into iso-8859-1.
>
> At this point I am not sure where/how a fix should occur  so I thought I'd
> run it by you in case I should file a bug. Note that I don't use http.client
> directly, but through the python-requests library.

I'm not 100% sure, but I believe non-ASCII characters are outright
forbidden in a Location: header. It's possible that an RFC2047 tag
might be used, but my reading of RFC2616 is that that's only for text
fields, not for Location. These non-ASCII characters ought to be
percent-encoded, and anything doing otherwise is buggy.

Confirming what you're seeing with a plain socket:

>>> s=socket.socket()
>>> s.connect(("www.starbucks.com",80))
>>> s.send(b'GET 
>>> /store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010 
>>> HTTP/1.1\r\nHost: www.starbucks.com\r\nAccept-Encoding: identity\r\n\r\n')
136
>>> s.recv(1024)
b'HTTP/1.1 301 Moved Permanently\r\nContent-Type: text/html;
charset=UTF-8\r\nLocation:
http://www.starbucks.com/store/158/at/karntnerstrasse/k\xc3\xa4rntnerstrasse-49-vienna-9-1010\r\n
 '

I'm pretty sure that server is in violation of the spec, so all bets
are off as to what any other server will do. If you know you're
dealing with this one server, you can probably hack around this, but I
don't think it belongs in core code. Unless, of course, I'm completely
wrong about the spec, or if there's a de facto spec that lots of
servers follow, in which case maybe it would be worth doing.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers

2014-01-04 Thread Hugo G. Fierro
Hi Python devs,

I am trying to download an HTML document. I get an HTTP 301 (Moved
Permanently) with a UTF-8 encoded Location header and http.client decodes
it as iso-8859-1. When there's a non-ASCII character in the redirect URL
then I can't download the document.

In client.py def parse_headers() I see the call to decode('iso-8859-1'). My
personal  hack is to use whatever charset is defined in the Content-Type
HTTP header (utf8) or fall back into iso-8859-1.

At this point I am not sure where/how a fix should occur  so I thought I'd
run it by you in case I should file a bug. Note that I don't use
http.client directly, but through the python-requests library.

I include some code to reproduce the problem below.

Cheers,

Hugo

-

#!/usr/bin/env python3

# Trying to replicate what wget does with a 301 redirect:
# wget --server-response
www.starbucks.com/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010

import http.client
import urllib.parse

s2='/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010'
s3='
http://www.starbucks.com/store/158/at/karntnerstrasse/k%C3%A4rntnerstrasse-49-vienna-9-1010
'

conn = http.client.HTTPConnection('www.starbucks.com')
conn.request('GET', s2)
r = conn.getresponse()
print('Location', r.headers.get('Location'))
print('Expected', urllib.parse.unquote(s3))
assert r.status == 301
assert r.headers.get('Location') == urllib.parse.unquote(s3), \
'decoded as iso-8859-1 instead of utf8'

conn = http.client.HTTPConnection('www.starbucks.com')
conn.request('GET', s3)
r = conn.getresponse()
assert r.status == 200
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com