Re: Internationalized domain names not working with URLopen
On 6/13/2012 1:17 AM, John Nagle wrote: What does urllib2 want? Percent escapes? Punycode? Looks like Punycode is the correct answer: https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode I haven't tried it, though. -- CPython 3.3.0a3 | Windows NT 6.1.7601.17790 -- http://mail.python.org/mailman/listinfo/python-list
Re: Internationalized domain names not working with URLopen
Answer in this topic should help you to solve issue. http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top Regards. 2012/6/13 John Nagle na...@animats.com I'm trying to open http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f with urllib2.urlopen(s1) in Python 2.7 on Windows 7. This produces a Unicode exception: s1 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\** u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435' fd = urllib2.urlopen(s1) Traceback (most recent call last): File stdin, line 1, in module File C:\python27\lib\urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File C:\python27\lib\urllib2.py, line 394, in open response = self._open(req, data) File C:\python27\lib\urllib2.py, line 412, in _open '_open', req) File C:\python27\lib\urllib2.py, line 372, in _call_chain result = func(*args) File C:\python27\lib\urllib2.py, line 1199, in http_open return self.do_open(httplib.**HTTPConnection, req) File C:\python27\lib\urllib2.py, line 1168, in do_open h.request(req.get_method(), req.get_selector(), req.data, headers) File C:\python27\lib\httplib.py, line 955, in request self._send_request(method, url, body, headers) File C:\python27\lib\httplib.py, line 988, in _send_request self.putheader(hdr, value) File C:\python27\lib\httplib.py, line 935, in putheader hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values])) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) The HTTP library is trying to put the URL in the header as ASCII. Why isn't urllib2 handling that? What does urllib2 want? Percent escapes? Punycode? John Nagle -- http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Internationalized domain names not working with URLopen
Well not really! does not work with '☃.net' Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.6/urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File /usr/lib/python2.6/urllib2.py, line 391, in open response = self._open(req, data) File /usr/lib/python2.6/urllib2.py, line 409, in _open '_open', req) File /usr/lib/python2.6/urllib2.py, line 369, in _call_chain result = func(*args) File /usr/lib/python2.6/urllib2.py, line 1170, in http_open return self.do_open(httplib.HTTPConnection, req) File /usr/lib/python2.6/urllib2.py, line 1116, in do_open h = http_class(host, timeout=req.timeout) # will parse host:port File /usr/lib/python2.6/httplib.py, line 661, in __init__ self._set_hostport(host, port) File /usr/lib/python2.6/httplib.py, line 686, in _set_hostport raise InvalidURL(nonnumeric port: '%s' % host[i+1:]) httplib.InvalidURL: nonnumeric port: On Wed, Jun 13, 2012 at 12:17 PM, Виталий Волков hash...@gmail.com wrote: Answer in this topic should help you to solve issue. http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top Regards. 2012/6/13 John Nagle na...@animats.com I'm trying to open http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f with urllib2.urlopen(s1) in Python 2.7 on Windows 7. This produces a Unicode exception: s1 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\** u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435' fd = urllib2.urlopen(s1) Traceback (most recent call last): File stdin, line 1, in module File C:\python27\lib\urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File C:\python27\lib\urllib2.py, line 394, in open response = self._open(req, data) File C:\python27\lib\urllib2.py, line 412, in _open '_open', req) File C:\python27\lib\urllib2.py, line 372, in _call_chain result = func(*args) File C:\python27\lib\urllib2.py, line 1199, in http_open return self.do_open(httplib.**HTTPConnection, req) File C:\python27\lib\urllib2.py, line 1168, in do_open h.request(req.get_method(), req.get_selector(), req.data, headers) File C:\python27\lib\httplib.py, line 955, in request self._send_request(method, url, body, headers) File C:\python27\lib\httplib.py, line 988, in _send_request self.putheader(hdr, value) File C:\python27\lib\httplib.py, line 935, in putheader hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values])) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) The HTTP library is trying to put the URL in the header as ASCII. Why isn't urllib2 handling that? What does urllib2 want? Percent escapes? Punycode? John Nagle -- http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list -- *'I am what I am because of who we all are'* h3manth.com http://www.h3manth.com *-- Hemanth HM * -- http://mail.python.org/mailman/listinfo/python-list
Re: Internationalized domain names not working with URLopen
My bad, it worked; need to avoid http:// along with snowman, before encode. On Wed, Jun 13, 2012 at 9:02 PM, Hemanth H.M hemanth...@gmail.com wrote: Well not really! does not work with '☃.net' Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.6/urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File /usr/lib/python2.6/urllib2.py, line 391, in open response = self._open(req, data) File /usr/lib/python2.6/urllib2.py, line 409, in _open '_open', req) File /usr/lib/python2.6/urllib2.py, line 369, in _call_chain result = func(*args) File /usr/lib/python2.6/urllib2.py, line 1170, in http_open return self.do_open(httplib.HTTPConnection, req) File /usr/lib/python2.6/urllib2.py, line 1116, in do_open h = http_class(host, timeout=req.timeout) # will parse host:port File /usr/lib/python2.6/httplib.py, line 661, in __init__ self._set_hostport(host, port) File /usr/lib/python2.6/httplib.py, line 686, in _set_hostport raise InvalidURL(nonnumeric port: '%s' % host[i+1:]) httplib.InvalidURL: nonnumeric port: On Wed, Jun 13, 2012 at 12:17 PM, Виталий Волков hash...@gmail.comwrote: Answer in this topic should help you to solve issue. http://stackoverflow.com/questions/8152161/open-persian-url-domains-with-urllib2?answertab=active#tab-top Regards. 2012/6/13 John Nagle na...@animats.com I'm trying to open http://пример.испытание http://xn--e1afmkfd.xn--80akhbyknj4f with urllib2.urlopen(s1) in Python 2.7 on Windows 7. This produces a Unicode exception: s1 u'http://\u043f\u0440\u0438\**u043c\u0435\u0440.\u0438\** u0441\u043f\u044b\u0442\u0430\**u043d\u0438\u0435' fd = urllib2.urlopen(s1) Traceback (most recent call last): File stdin, line 1, in module File C:\python27\lib\urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File C:\python27\lib\urllib2.py, line 394, in open response = self._open(req, data) File C:\python27\lib\urllib2.py, line 412, in _open '_open', req) File C:\python27\lib\urllib2.py, line 372, in _call_chain result = func(*args) File C:\python27\lib\urllib2.py, line 1199, in http_open return self.do_open(httplib.**HTTPConnection, req) File C:\python27\lib\urllib2.py, line 1168, in do_open h.request(req.get_method(), req.get_selector(), req.data, headers) File C:\python27\lib\httplib.py, line 955, in request self._send_request(method, url, body, headers) File C:\python27\lib\httplib.py, line 988, in _send_request self.putheader(hdr, value) File C:\python27\lib\httplib.py, line 935, in putheader hdr = '%s: %s' % (header, '\r\n\t'.join([str(v) for v in values])) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) The HTTP library is trying to put the URL in the header as ASCII. Why isn't urllib2 handling that? What does urllib2 want? Percent escapes? Punycode? John Nagle -- http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list -- *'I am what I am because of who we all are'* h3manth.com http://www.h3manth.com *-- Hemanth HM * -- *'I am what I am because of who we all are'* h3manth.com http://www.h3manth.com *-- Hemanth HM * -- http://mail.python.org/mailman/listinfo/python-list
Re: Internationalized domain names not working with URLopen
On 6/12/2012 11:42 PM, Andrew Berg wrote: On 6/13/2012 1:17 AM, John Nagle wrote: What does urllib2 want? Percent escapes? Punycode? Looks like Punycode is the correct answer: https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode I haven't tried it, though. This is Python bug #9679: http://bugs.python.org/issue9679 It's been open for years, and the maintainers offer elaborate excuses for not fixing the problem. The socket module accepts Unicode domains, as does httplib. But urllib2, which is a front end to both, is still broken. It's failing when it constructs the HTTP headers. Domains in HTTP headers have to be in punycode. The code in stackoverflow doesn't really work right. Only the domain part of a URL should be converted to punycode. Path, port, and query parameters need to be converted to percent-encoding. (Unclear if urllib2 or httplib does this already. The documentation doesn't say.) While HTTP content can be in various character sets, the headers are currently required to be ASCII only, since the header has to be processed to determine the character code. (http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html) Here's a workaround, for the domain part only. # # idnaurlworkaround -- workaround for Python defect 9679 # PYTHONDEFECT9679FIXED = False # Python defect #9679 - change when fixed def idnaurlworkaround(url) : Convert a URL to a form the currently broken urllib2 will accept. Converts the domain to punycode if necessary. This is a workaround for Python defect #9679. if PYTHONDEFECT9679FIXED : # if defect fixed return(url) # use unmodified URL url = unicode(url) # force to Unicode (scheme, accesshost, path, params, query, fragment) = urlparse.urlparse(url)# parse URL if scheme == '' and accesshost == '' and path != '' : # bare domain accesshost = path # use path as access host path = '' # no path labels = accesshost.split('.') # split domain into sections (labels) labels = [encodings.idna.ToASCII(w) for w in labels]# convert each label to punycode if necessary accesshost = '.'.join(labels) # reassemble domain url = urlparse.urlunparse((scheme, accesshost, path, params, query, fragment)) # reassemble url return(url) # return complete URL with punycode domain John Nagle -- http://mail.python.org/mailman/listinfo/python-list