Ignoring Robots.txt

Super Man Fri, 11 Sep 2009 02:30:39 -0700

Hi,

I want to crawl a website which denies access to all crawlers. The
website is our own, so there are no issues with crawling it, but the
sysadmin doesnt want to change the robots.txt for the fear that we get
many other impersonated crawlers once we allow a crawler.


Is it possible to configure nutch to ignore robots.txt? I set the
Protocol.CHECK_ROBOTS property to false in nutch-site.xml, but that
doesnt seem to help.

Any clues?

Thanks,
Zee

Ignoring Robots.txt

Reply via email to