Re: Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

Filip Salomonsson Tue, 02 Oct 2007 07:13:34 -0700

On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote:
>
> But there's something in there now that robotparser doesn't like.
> Any ideas?


Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
wiki markup to HTML).
-- 
filip salomonsson
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

Reply via email to