Crawling sites with 403 Forbidden robots.txt
--------------------------------------------

         Key: NUTCH-56
         URL: http://issues.apache.org/jira/browse/NUTCH-56
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Reporter: Andy Liu
    Priority: Minor
 Attachments: robots_403.patch

If a 403 error is encountered when trying to access the robots.txt file, Nutch 
does not crawl any pages from that site.  This behavior is consistent with the 
RFC recommendation for the robot exclusion protocol.  

However, Google does crawl sites that exhibit this type of behavior, because 
most webmasters of these sites are unaware of robots.txt conventions and do 
want their site to be crawled.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to