Re: Internationalized domain names not working with URLopen

2012-06-13 Thread Andrew Berg
On 6/13/2012 1:17 AM, John Nagle wrote:
 What does urllib2 want?  Percent escapes?  Punycode?
Looks like Punycode is the correct answer:
https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

I haven't tried it, though.
-- 
CPython 3.3.0a3 | Windows NT 6.1.7601.17790


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Internationalized domain names not working with URLopen

2012-06-13 Thread Виталий Волков
Answer in this topic should help you to solve issue.

http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top


Regards.

2012/6/13 John Nagle na...@animats.com

 I'm trying to open

 http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f

 with

 urllib2.urlopen(s1)

 in Python 2.7 on Windows 7. This produces a Unicode exception:

  s1
 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\**
 u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435'
  fd = urllib2.urlopen(s1)
 Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\python27\lib\urllib2.py, line 126, in urlopen
return _opener.open(url, data, timeout)
  File C:\python27\lib\urllib2.py, line 394, in open
response = self._open(req, data)
  File C:\python27\lib\urllib2.py, line 412, in _open
'_open', req)
  File C:\python27\lib\urllib2.py, line 372, in _call_chain
result = func(*args)
  File C:\python27\lib\urllib2.py, line 1199, in http_open
return self.do_open(httplib.**HTTPConnection, req)
  File C:\python27\lib\urllib2.py, line 1168, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
  File C:\python27\lib\httplib.py, line 955, in request
self._send_request(method, url, body, headers)
  File C:\python27\lib\httplib.py, line 988, in _send_request
self.putheader(hdr, value)
  File C:\python27\lib\httplib.py, line 935, in putheader
hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values]))
 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5:
 ordinal not in range(128)
 

 The HTTP library is trying to put the URL in the header as ASCII.  Why
 isn't urllib2 handling that?

 What does urllib2 want?  Percent escapes?  Punycode?

John Nagle
 --
 http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Internationalized domain names not working with URLopen

2012-06-13 Thread Hemanth H.M
Well not really! does not work with '☃.net'

Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.6/urllib2.py, line 126, in urlopen
return _opener.open(url, data, timeout)
  File /usr/lib/python2.6/urllib2.py, line 391, in open
response = self._open(req, data)
  File /usr/lib/python2.6/urllib2.py, line 409, in _open
'_open', req)
  File /usr/lib/python2.6/urllib2.py, line 369, in _call_chain
result = func(*args)
  File /usr/lib/python2.6/urllib2.py, line 1170, in http_open
return self.do_open(httplib.HTTPConnection, req)
  File /usr/lib/python2.6/urllib2.py, line 1116, in do_open
h = http_class(host, timeout=req.timeout) # will parse host:port
  File /usr/lib/python2.6/httplib.py, line 661, in __init__
self._set_hostport(host, port)
  File /usr/lib/python2.6/httplib.py, line 686, in _set_hostport
raise InvalidURL(nonnumeric port: '%s' % host[i+1:])
httplib.InvalidURL: nonnumeric port:


On Wed, Jun 13, 2012 at 12:17 PM, Виталий Волков hash...@gmail.com wrote:

 Answer in this topic should help you to solve issue.


 http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top


 Regards.


 2012/6/13 John Nagle na...@animats.com

 I'm trying to open

 http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f

 with

 urllib2.urlopen(s1)

 in Python 2.7 on Windows 7. This produces a Unicode exception:

  s1
 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\**
 u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435'
  fd = urllib2.urlopen(s1)
 Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\python27\lib\urllib2.py, line 126, in urlopen
return _opener.open(url, data, timeout)
  File C:\python27\lib\urllib2.py, line 394, in open
response = self._open(req, data)
  File C:\python27\lib\urllib2.py, line 412, in _open
'_open', req)
  File C:\python27\lib\urllib2.py, line 372, in _call_chain
result = func(*args)
  File C:\python27\lib\urllib2.py, line 1199, in http_open
return self.do_open(httplib.**HTTPConnection, req)
  File C:\python27\lib\urllib2.py, line 1168, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
  File C:\python27\lib\httplib.py, line 955, in request
self._send_request(method, url, body, headers)
  File C:\python27\lib\httplib.py, line 988, in _send_request
self.putheader(hdr, value)
  File C:\python27\lib\httplib.py, line 935, in putheader
hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values]))
 UnicodeEncodeError: 'ascii' codec can't encode characters in position
 0-5: ordinal not in range(128)
 

 The HTTP library is trying to put the URL in the header as ASCII.  Why
 isn't urllib2 handling that?

 What does urllib2 want?  Percent escapes?  Punycode?

John Nagle
 --
 http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list



 --
 http://mail.python.org/mailman/listinfo/python-list




-- 
*'I am what I am because of who we all are'*
h3manth.com http://www.h3manth.com
*-- Hemanth HM *
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Internationalized domain names not working with URLopen

2012-06-13 Thread Hemanth H.M
My bad, it worked; need to avoid http:// along with snowman, before encode.

On Wed, Jun 13, 2012 at 9:02 PM, Hemanth H.M hemanth...@gmail.com wrote:

 Well not really! does not work with '☃.net'

 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/lib/python2.6/urllib2.py, line 126, in urlopen
 return _opener.open(url, data, timeout)
   File /usr/lib/python2.6/urllib2.py, line 391, in open
 response = self._open(req, data)
   File /usr/lib/python2.6/urllib2.py, line 409, in _open
 '_open', req)
   File /usr/lib/python2.6/urllib2.py, line 369, in _call_chain
 result = func(*args)
   File /usr/lib/python2.6/urllib2.py, line 1170, in http_open
 return self.do_open(httplib.HTTPConnection, req)
   File /usr/lib/python2.6/urllib2.py, line 1116, in do_open
 h = http_class(host, timeout=req.timeout) # will parse host:port
   File /usr/lib/python2.6/httplib.py, line 661, in __init__
 self._set_hostport(host, port)
   File /usr/lib/python2.6/httplib.py, line 686, in _set_hostport
 raise InvalidURL(nonnumeric port: '%s' % host[i+1:])
 httplib.InvalidURL: nonnumeric port:


 On Wed, Jun 13, 2012 at 12:17 PM, Виталий Волков hash...@gmail.comwrote:

 Answer in this topic should help you to solve issue.


 http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top


 Regards.


 2012/6/13 John Nagle na...@animats.com

 I'm trying to open

 http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f

 with

 urllib2.urlopen(s1)

 in Python 2.7 on Windows 7. This produces a Unicode exception:

  s1
 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\**
 u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435'
  fd = urllib2.urlopen(s1)
 Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\python27\lib\urllib2.py, line 126, in urlopen
return _opener.open(url, data, timeout)
  File C:\python27\lib\urllib2.py, line 394, in open
response = self._open(req, data)
  File C:\python27\lib\urllib2.py, line 412, in _open
'_open', req)
  File C:\python27\lib\urllib2.py, line 372, in _call_chain
result = func(*args)
  File C:\python27\lib\urllib2.py, line 1199, in http_open
return self.do_open(httplib.**HTTPConnection, req)
  File C:\python27\lib\urllib2.py, line 1168, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
  File C:\python27\lib\httplib.py, line 955, in request
self._send_request(method, url, body, headers)
  File C:\python27\lib\httplib.py, line 988, in _send_request
self.putheader(hdr, value)
  File C:\python27\lib\httplib.py, line 935, in putheader
hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values]))
 UnicodeEncodeError: 'ascii' codec can't encode characters in position
 0-5: ordinal not in range(128)
 

 The HTTP library is trying to put the URL in the header as ASCII.  Why
 isn't urllib2 handling that?

 What does urllib2 want?  Percent escapes?  Punycode?

John Nagle
 --
 http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list



 --
 http://mail.python.org/mailman/listinfo/python-list




 --
 *'I am what I am because of who we all are'*
 h3manth.com http://www.h3manth.com
 *-- Hemanth HM *




-- 
*'I am what I am because of who we all are'*
h3manth.com http://www.h3manth.com
*-- Hemanth HM *
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Internationalized domain names not working with URLopen

2012-06-13 Thread John Nagle

On 6/12/2012 11:42 PM, Andrew Berg wrote:

On 6/13/2012 1:17 AM, John Nagle wrote:

What does urllib2 want?  Percent escapes?  Punycode?

Looks like Punycode is the correct answer:
https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

I haven't tried it, though.


   This is Python bug #9679:

http://bugs.python.org/issue9679

It's been open for years, and the maintainers offer elaborate
excuses for not fixing the problem.

The socket module accepts Unicode domains, as does httplib.
But urllib2, which is a front end to both, is still broken.
It's failing when it constructs the HTTP headers.  Domains
in HTTP headers have to be in punycode.

The code in stackoverflow doesn't really work right.  Only
the domain part of a URL should be converted to punycode.
Path, port, and query parameters need to be converted to
percent-encoding.  (Unclear if urllib2 or httplib does this
already.  The documentation doesn't say.)

While HTTP content can be in various character sets, the
headers are currently required to be ASCII only, since the
header has to be processed to determine the character code.
(http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html)

Here's a workaround, for the domain part only.


#
#   idnaurlworkaround  --  workaround for Python defect 9679
#
PYTHONDEFECT9679FIXED = False # Python defect #9679 - change when fixed

def idnaurlworkaround(url) :

Convert a URL to a form the currently broken urllib2 will accept.
Converts the domain to punycode if necessary.
This is a workaround for Python defect #9679.

if PYTHONDEFECT9679FIXED :  # if defect fixed
return(url)   # use unmodified URL
url = unicode(url)  # force to Unicode
(scheme, accesshost, path, params,
query, fragment) = urlparse.urlparse(url)# parse URL
if scheme == '' and accesshost == '' and path != '' : # bare domain
accesshost = path # use path as access host
path = '' # no path
labels = accesshost.split('.') # split domain into sections (labels)
labels = [encodings.idna.ToASCII(w) for w in labels]# convert each 
label to punycode if necessary

accesshost = '.'.join(labels) # reassemble domain
url = urlparse.urlunparse((scheme, accesshost, path, params, query, 
fragment))  # reassemble url

return(url) # return complete URL with punycode domain

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list