Crawling sites with 403 Forbidden robots.txt
--------------------------------------------
Key: NUTCH-56
URL: http://issues.apache.org/jira/browse/NUTCH-56
Project: Nutch
Type: Improvement
Components: fetcher
Reporter: Andy Liu
Priority: Minor
Attachments: robots_403.patch
If a 403 error is encountered when trying to access the robots.txt file, Nutch
does not crawl any pages from that site. This behavior is consistent with the
RFC recommendation for the robot exclusion protocol.
However, Google does crawl sites that exhibit this type of behavior, because
most webmasters of these sites are unaware of robots.txt conventions and do
want their site to be crawled.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira