In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

> Filip Salomonsson wrote:
> > On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote:
> >> But there's something in there now that robotparser doesn't like.
> >> Any ideas?
> > 
> > Wikipedia denies _all_ access for the standard urllib user agent, and
> > when the robotparser gets a 401 or 403 response when trying to fetch
> > robots.txt, it is equivalent to "Disallow: *".
> > 
> > http://infix.se/2006/05/17/robotparser
> 
>      That explains it.  It's an undocumented feature of "robotparser",
> as is the 'errcode' variable.  The documentation of "robotparser" is
> silent on error handling (can it raise an exception?) and should be
> updated.

Hi John,
Robotparser is probably following the never-approved RFC for robots.txt 
which is the closest thing there is to a standard. It says, "On server 
response indicating access restrictions (HTTP Status Code 401 or 403) a 
robot should regard access to the site completely restricted."
http://www.robotstxt.org/wc/norobots-rfc.html

If you're interested, I have a replacement for the robotparser module 
that works a little better (IMHO) and which you might also find better 
documented. I'm using it in production code:
http://nikitathespider.com/python/rerp/

Happy spidering

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to