[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477667#comment-17477667 ]
Hudson commented on NUTCH-2573: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #71 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/71/]) NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status (#724) (github: [https://github.com/apache/nutch/commit/f691baebc3c04c08ea500f4767e2decb88c30c70]) * (edit) src/java/org/apache/nutch/fetcher/FetchItemQueues.java * (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java * (edit) conf/nutch-default.xml * (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java > Suspend crawling if robots.txt fails to fetch with 5xx status > ------------------------------------------------------------- > > Key: NUTCH-2573 > URL: https://issues.apache.org/jira/browse/NUTCH-2573 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 1.14 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.19 > > > Fetcher should optionally (by default) suspend crawling by a configurable > interval when fetching the robots.txt fails with a server errors (HTTP status > code 5xx, esp. 503) following [Google's spec| > https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]: > ??5xx (server error)?? > ??Server errors are seen as temporary errors that result in a "full disallow" > of crawling. The request is retried until a non-server-error HTTP result code > is obtained. A 503 (Service Unavailable) error will result in fairly frequent > retrying. To temporarily suspend crawling, it is recommended to serve a 503 > HTTP result code. Handling of a permanent server error is undefined.?? > See also the [draft robots.txt RFC, section "Unreachable > status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4]. > Crawler-commons robots rules already provide > [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--] > to store this information (must be set from RobotRulesParser). -- This message was sent by Atlassian Jira (v8.20.1#820001)