Hi,I'm using Nutch 1.9 with Solr 4.9.1. I am trying to extract news articles. Nutch works for some sites, but for others I get 403 failed fetch. This is the output when I run parsechecker.
bin/nutch parsechecker -dumpText http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 fetching: http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 WIth bin/crawl i get fetch of http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 failed with: Http code=403, url=http://www.dnaindia.com/india/report-bjp-leader-files-complaint-against-aap-for-defaming-his-party-2073977 The regex filter for this site that I entered+^http://([a-z0-9]*\.)*dnaindia.com nutch-default.xml has this default value<property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description></property> Anything I am missing? For what reason am I still getting failed fetch? Ankit

