On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote: > > But there's something in there now that robotparser doesn't like. > Any ideas?
Wikipedia denies _all_ access for the standard urllib user agent, and when the robotparser gets a 401 or 403 response when trying to fetch robots.txt, it is equivalent to "Disallow: *". http://infix.se/2006/05/17/robotparser It could also be worth mentioning that if you were planning on crawling a lot of Wikipedia pages, you may be better off downloading the whole thing instead: <http://download.wikimedia.org/> (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the wiki markup to HTML). -- filip salomonsson -- http://mail.python.org/mailman/listinfo/python-list