karl added the comment:

Setting a user agent string should be possible.
My guess is that the default library has been used by an abusive client (by 
mistake or intent) and wikimedia project has decided to blacklist the client 
based on the user-agent string sniffing.

The match is on anything which matches

"Python-urllib" in UserAgentString

See below:

>>> import urllib.request
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Python-urllib')]
>>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 479, in open
    response = meth(req, response)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 517, in error
    return self._call_chain(*args)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 451, in _call_chain
    result = func(*args)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>> import urllib.request
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Pythonurllib/3.3')]
>>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
>>> fobj
<http.client.HTTPResponse object at 0x101275850>
>>> import urllib.request
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Pyt-honurllib/3.3')]
>>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
>>> import urllib.request
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Python-urllib')]
>>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 479, in open
    response = meth(req, response)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 517, in error
    return self._call_chain(*args)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 451, in _call_chain
    result = func(*args)
  File 
"/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py",
 line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>> import urllib.request
>>> opener = urllib.request.build_opener()
>>> opener.addheaders = [('User-agent', 'Python-urlli')]
>>> fobj = opener.open('http://en.wikipedia.org/robots.txt')
>>> 

Being able to change the header might indeed be a good thing.

----------
nosy: +karlcow

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue15851>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to