Re: Webcast : Apache Nutch on EMR

2015-09-26 Thread Julien Nioche
Hi Lewis > > > Whats your thoughts about making this part of the scrolling banner on the > homepage? > a bit OTT I think. I need to dig up my Wiki credentials and add the video + blog entry on the documentation page. > I think it is great. > Thanks mate Julien -- *Open Source Solutions

Re: Regarding whitelist for robots.txt

2015-09-26 Thread Sebastian Nagel
Hi Girish, > in the hadoop.log i see “robots.txt whitelist not configured" This means that the property is somehow not set properly. Shouldn't it be http.robot.rules.whitelist", see below? Also make sure that the modified nutch-site.xml is deployed. If you modify it in conf/ you have to run

Regarding whitelist for robots.txt

2015-09-26 Thread Girish Rao
Hi, I am trying to set the whitelist property in nutch-site.xml as below: robot.rules.whitelist test.org Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. However, when i see the crawl data, i still see that the files have not been crawled and they

Re: Regarding whitelist for robots.txt

2015-09-26 Thread Girish Rao
Hi Sebastian, Thanks! I had copied that from the wiki located at https://wiki.apache.org/nutch/WhiteListRobots . Once I changed it to http.robot.rules.whitelist, i see in the logs that the test.org is whitelisted. However, on