Hi, I want to crawl a website which denies access to all crawlers. The website is our own, so there are no issues with crawling it, but the sysadmin doesnt want to change the robots.txt for the fear that we get many other impersonated crawlers once we allow a crawler.
Is it possible to configure nutch to ignore robots.txt? I set the Protocol.CHECK_ROBOTS property to false in nutch-site.xml, but that doesnt seem to help. Any clues? Thanks, Zee
