Hi, I am trying to set the whitelist property in nutch-site.xml as below: <property> <name>robot.rules.whitelist</name> <value>test.org</value> <description>Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. </description> </property>
However, when i see the crawl data, i still see that the files have not been crawled and they have a status like “blocked by robots.txt” in the hadoop.log i see “robots.txt whitelist not configured" Is there anything else that needs to be done? Regards Girish