[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2019-02-20 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
nosy: +remi.lapeyre

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2019-02-20 Thread Diego Rojas


Change by Diego Rojas :


--
components: +Extension Modules -Library (Lib), Unicode
type: enhancement -> behavior
versions: +Python 3.4, Python 3.5, Python 3.6, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2017-01-29 Thread R. David Murray

R. David Murray added the comment:

I believe the last time this subject was discussed the conclusion was that we 
really needed a full IRI module that conformed to the relevant RFCs, and that 
putting something on pypi would be one way to get there.  

Someone should research the existing packages.  It might be that we need 
something simpler than what exists, but whatever we do should be informed by 
what exists, I think.

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2017-01-28 Thread Martin Panter

Martin Panter added the comment:

I’m not really an expert on non-ASCII URLs / IRIs. Maybe it is obvious to other 
people that this is a good general implementation, but for me to thoroughly 
review it I would need time to research the relevant RFCs, other 
implementations, suitability for the URL schemes listed at 
, security implications, 
etc.

One problem problem with using urlunsplit() is it would strip empty URL 
components, e.g. quote_iri("http://example/file#";) -> "http://example/file";. 
See Issue 22852. This is highlighted by the file:///[. . .] → file:/[. . .] 
test case.

FYI Martin Panter and vadmium are both just me, no need to get too excited. :) 
I just updated my settings for Rietveld (code review), so hopefully that is 
more obvious now.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2017-01-27 Thread Andreas Åkerlund

Andreas Åkerlund added the comment:

Changed the patch after pointers from vadmium.
And quote_uri is changed to quote_iri as martin.panter thought it was more 
appropriate.

--
Added file: http://bugs.python.org/file46440/issue3991_2017-01-27.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2017-01-18 Thread Martin Panter

Martin Panter added the comment:

Issue 9679: Focusses on encoding just the DNS name
Issue 20559: Maybe a duplicate, or opportunity for better documentation or 
error message as a bug fix?

Andreas’s patch just proposes a new function called quote_uri(). It would need 
documentation. We already have a quote() and quote_plus() function. Since it 
sounds like this is for IRIs (https://tools.ietf.org/html/rfc3987), would it be 
more appropriate to call it quote_iri()?

See revision cb09fdef19f5, especially the quote(safe=...) parameter, for how I 
avoided the double encoding problem.

--
dependencies: +unicode DNS names in urllib, urlopen
nosy: +martin.panter
stage: test needed -> patch review
versions: +Python 3.7 -Python 3.2, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2014-05-28 Thread Graham Oliver

Graham Oliver added the comment:

hello
I came across this bug when using 'ā' in a url
To get around the problem I used the 'URL encoded' version '%C4%81' instead of 
'ā'
See this page
http://www.charbase.com/0101-unicode-latin-small-letter-a-with-macron
I tried using the 'puny code' for 'ā' 'xn--yda' but that didn't work

--
nosy: +Graham.Oliver

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2013-02-25 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2013-02-23 Thread Andreas Åkerlund

Andreas Åkerlund added the comment:

This is a patch against 3.2 adding urllib.parse.quote_uri

It splits the URI in 5 parts (protocol, authentication, hostname, port and 
path) then runs urllib.parse.quote on the path and encodes the hostname to 
punycode if it's not in ascii.

It's not perfect, but should be usable in most cases.
I created some test cases aswell.

--
nosy: +thezulk
Added file: http://bugs.python.org/file29194/issue3991.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2012-10-01 Thread Ezio Melotti

Changes by Ezio Melotti :


--
versions: +Python 3.2, Python 3.3, Python 3.4 -Python 3.0

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2009-04-22 Thread Daniel Diniz

Changes by Daniel Diniz :


--
priority:  -> normal

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2009-02-12 Thread Daniel Diniz

Changes by Daniel Diniz :


--
nosy: +orsenthil

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2009-02-11 Thread Daniel Diniz

Changes by Daniel Diniz :


--
components: +Library (Lib)
keywords: +easy
stage:  -> test needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2009-02-08 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2009-02-08 Thread Daniel Diniz

Daniel Diniz  added the comment:

I think Toshio's usecase is important enough to deserve a fix (patch
attached) or a special-cased error message. IMO, newbies trying to fix
failures from urlopen may have a hard time figuring out the maze:

urlopen -> _opener -> open -> _open -> _call_chain -> http_open -> 
do_open (and that's before leaving urllib!).

>>> from urllib.request import urlopen
>>> url = 'http://localhost/ñ.html'
>>> urlopen(url).read()
Traceback (most recent call last):
[...]
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in
position 5: ordinal not in range(128)


If the newbie isn't completely lost by then, how about:
>>> from urllib.parse import quote
>>> urlopen(quote(url)).read()
Traceback (most recent call last):
[...]
ValueError: unknown url type: http%3A//localhost/%C3%B1.html

--
keywords: +patch
nosy: +ajaksu2
Added file: http://bugs.python.org/file12986/non_ascii_path.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-30 Thread Toshio Kuratomi

Toshio Kuratomi <[EMAIL PROTECTED]> added the comment:

Oh, that's cool.  I've been fine with this being a request for a needed
function to quote and unquote full urls rather than a bug in urlopen().

I think iri's are a distraction here, though.  The RFC for iris even
says that specifications that call for uris and do not mention iris
should not take iris.  So there's definitely a need for a function to
quote a full uri.

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-30 Thread Bill Janssen

Bill Janssen <[EMAIL PROTECTED]> added the comment:

I'm not concerned about any example inputs.  I was just trying to
explain why this isn't a bug.

On the other hand, the IRI spec (RFC 3897) is another thing we might
try to implement for Python.

--
type:  -> feature request

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-30 Thread Toshio Kuratomi

Toshio Kuratomi <[EMAIL PROTECTED]> added the comment:

The purpose of such a function would be to take something that is not a
valid uri but 1) is a common way of expressing the way to get to the
resource and 2) follows certain rules and turns that into something that
is a valid uri.  non-ASCii strings in the path are a good example of
this since there is a well defined method to encode the strings into the
URL if you are given a character encoding to apply to it.

My first, naive thought is that if the input can be parsed by
urlparse(), then there is a very good chance that we have the ability to
escape the string properly.  Looking at the invalid uri that I gave, for
instance, if you additionally specified an encoding for the path element
there's no reason a function couldn't do the escaping.

What are example inputs that you are concerned about?  I'll see if I can
come up with code that works with them.

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-30 Thread Bill Janssen

Bill Janssen <[EMAIL PROTECTED]> added the comment:

It's not immediately clear to me how an auto-quote function can be
written; as you say (and as the URI spec points out), you have to take
a URL apart before quoting it, and you can't parse an invalid URL,
which is what the input is.

Best to think of this as a difference from 2.x.

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-29 Thread Toshio Kuratomi

Toshio Kuratomi <[EMAIL PROTECTED]> added the comment:

Possibly.  This is a change from python-2.x's urlopen() which escaped
the URL automatically, though.  I can see the case for having the user
call an escape function themselves instead of having urlopen() perform
the escape for them.  However, that function would need to be written.
(The present parse.quote() method only quotes correctly if only the path
component is passed; there's no function to take a full URL and quote it
appropriately.)

Without such a function, a whole lot of code bases will have to reinvent
the wheel creating functions to parse the path out, run it through
urllib.parse.quote() and then pass the result to urlib.urlopen().

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-29 Thread Bill Janssen

Bill Janssen <[EMAIL PROTECTED]> added the comment:

As I read RFC 2396,

1.5:  "A URI is a sequence of characters from a very
   limited set, i.e. the letters of the basic Latin alphabet, digits,
   and a few special characters."

2.4:  "Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below."

So your URL string is invalid.  You need to escape the characters properly.

(RFC 2396 is what the HTTP RFC cites as its authority on URLs.)

--
nosy: +janssen

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue3991] urllib.request.urlopen does not handle non-ASCII characters

2008-09-28 Thread Toshio Kuratomi

New submission from Toshio Kuratomi <[EMAIL PROTECTED]>:

Tested on python-3.0rc1 -- Linux Fedora 9

I wanted to make sure that python3.0 would handle url's in different
encodings.  So I created two files on an apache server which were named
½ñ.html.  One of the filenames was encoded in utf-8 and the other in
latin-1.  Then I tried the following::

from urllib.request import urlopen
url = 'http://localhost/u/½ñ.html'
urlopen(url.encode('utf-8')).read()

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
return _opener.open(url, data, timeout)
  File "/usr/lib/python3.0/urllib/request.py", line 350, in open
req.timeout = timeout
AttributeError: 'bytes' object has no attribute 'timeout'

The same thing happens if I give None for the two optional arguments
(data and timeout).

Next I tried using a raw Unicode string:

>>> urlopen(url).read()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
return _opener.open(url, data, timeout)
  File "/usr/lib/python3.0/urllib/request.py", line 359, in open
response = self._open(req, data)
  File "/usr/lib/python3.0/urllib/request.py", line 377, in _open
'_open', req)
  File "/usr/lib/python3.0/urllib/request.py", line 337, in _call_chain
result = func(*args)
  File "/usr/lib/python3.0/urllib/request.py", line 1082, in http_open
return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.0/urllib/request.py", line 1068, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib/python3.0/http/client.py", line 843, in request
self._send_request(method, url, body, headers)
  File "/usr/lib/python3.0/http/client.py", line 860, in _send_request
self.putrequest(method, url, **skips)
  File "/usr/lib/python3.0/http/client.py", line 751, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position
7-8: ordinal not in range(128)

So, in python-3.0rc1, this method is badly broken.

--
components: Unicode
messages: 73982
nosy: a.badger
severity: normal
status: open
title: urllib.request.urlopen does not handle non-ASCII characters
versions: Python 3.0

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com