[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778123#comment-17778123 ]
Hudson commented on NUTCH-2990: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #132 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/132/]) NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 (#779) (github: [https://github.com/apache/nutch/commit/ecdd19dbdd4424bf9b9bce206f23992140ee43fe]) * (edit) conf/nutch-default.xml * (edit) src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java > HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 > ------------------------------------------------------------------- > > Key: NUTCH-2990 > URL: https://issues.apache.org/jira/browse/NUTCH-2990 > Project: Nutch > Issue Type: Improvement > Components: protocol, robots > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > The robots.txt parser > ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) > follows only one redirect when fetching the robots.txt while the robots.txt > RFC 9309 recommends to follow 5 redirects: > {quote} 2.3.1.2. Redirects > It's possible that a server responds to a robots.txt fetch request with a > redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers > SHOULD follow at least five consecutive redirects, even across authorities > (for example, hosts in the case of HTTP). > If a robots.txt file is reached within five consecutive redirects, the > robots.txt file MUST be fetched, parsed, and its rules followed in the > context of the initial authority. If there are more than five consecutive > redirects, crawlers MAY assume that the robots.txt file is unavailable. > (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote} > While following redirects, the parser should check whether the redirect > location is itself a "/robots.txt" on a different host and then try to read > it from the cache. -- This message was sent by Atlassian Jira (v8.20.10#820010)