Zee,

> My sysadm refuses to change the robots.txt citing the following reason:
> 
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.
> 
> Are you saying there is no way to configure nutch to ignore robots.txt?

We had a similar situation.

We modified the parse-html plugin, with a configurable flag
to adhere to robots.txt or not adhere to robots.txt.  Works
great.

JohnM

-- 
john mendenhall
j...@surfutopia.net
surf utopia
internet services

Reply via email to