Re: Ignoring Robots.txt

John Mendenhall Fri, 11 Sep 2009 10:18:26 -0700

Zee,

> My sysadm refuses to change the robots.txt citing the following reason:
> 
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.
> 
> Are you saying there is no way to configure nutch to ignore robots.txt?


We had a similar situation.

We modified the parse-html plugin, with a configurable flag
to adhere to robots.txt or not adhere to robots.txt.  Works
great.

JohnM

-- 
john mendenhall
j...@surfutopia.net
surf utopia
internet services

Re: Ignoring Robots.txt

Reply via email to