Regarding whitelist for robots.txt

Girish Rao Fri, 25 Sep 2015 23:59:39 -0700

Hi,

I am trying to set the whitelist property in nutch-site.xml
as below:
<property>
  <name>robot.rules.whitelist</name>
  <value>test.org</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
  </description>
</property>


However, when i see the crawl data, i still see that the files have not been 
crawled and they have a status like “blocked by robots.txt”

in the hadoop.log i see “robots.txt whitelist not configured"

Is there anything else that needs to be done?

Regards
Girish

Regarding whitelist for robots.txt

Reply via email to