Filip Salomonsson wrote: > On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote: >> But there's something in there now that robotparser doesn't like. >> Any ideas? > > Wikipedia denies _all_ access for the standard urllib user agent, and > when the robotparser gets a 401 or 403 response when trying to fetch > robots.txt, it is equivalent to "Disallow: *". > > http://infix.se/2006/05/17/robotparser
That explains it. It's an undocumented feature of "robotparser", as is the 'errcode' variable. The documentation of "robotparser" is silent on error handling (can it raise an exception?) and should be updated. > It could also be worth mentioning that if you were planning on > crawling a lot of Wikipedia pages, you may be better off downloading > the whole thing instead: <http://download.wikimedia.org/> > (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the > wiki markup to HTML). This is for SiteTruth, the site rating system (see "sitetruth.com"), and we never look at more than 21 pages per site. We're looking for the name and address of the business behind the web site, and if we can't find that after looking in 20 of the most obvious places to look, it's either not there or not "prominently disclosed". John Nagle -- http://mail.python.org/mailman/listinfo/python-list