Sebastian Nagel created NUTCH-2573:
--------------------------------------

             Summary: Suspend crawling if robots.txt fails to fetch with 5xx 
status
                 Key: NUTCH-2573
                 URL: https://issues.apache.org/jira/browse/NUTCH-2573
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.14
            Reporter: Sebastian Nagel
             Fix For: 1.15


Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

Crawler-commons robots rules already provide 
[isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (set from RobotRulesParser).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to