Hi,

I am trying to set the whitelist property in nutch-site.xml
as below:
<property>
  <name>robot.rules.whitelist</name>
  <value>test.org</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
  </description>
</property>

However, when i see the crawl data, i still see that the files have not been 
crawled and they have a status like “blocked by robots.txt”

in the hadoop.log i see “robots.txt whitelist not configured"

Is there anything else that needs to be done?

Regards
Girish

Reply via email to