[ https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754710#comment-17754710 ]
ASF GitHub Bot commented on NUTCH-2996: --------------------------------------- sebastian-nagel opened a new pull request, #766: URL: https://github.com/apache/nutch/pull/766 Note: because NUTCH-2996 requires the upgrade to crawler-commons 1.4, all changes of NUTCH-2995 are included. Only bc5326d contains the changes for NUTCH-2996. - split and lowercase agent names (if multiple) at configuration time and pass as collection to SimpleRobotRulesParser - update RobotRulesParser command-line help - update unit tests to use new API > Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4) > -------------------------------------------------------------------- > > Key: NUTCH-2996 > URL: https://issues.apache.org/jira/browse/NUTCH-2996 > Project: Nutch > Issue Type: Improvement > Components: robots > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) > introduces a new [API entry point to parse the robots.txt > content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]: > - it's more efficient by accepting a collection of lower-cased, single-word > user-agent product tokens, without the need to tokenize a (comma-separated) > list of user-agent strings again with every robots.txt > - user-agent matching is compliant with [RFC 9309 (section > 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] > only if the new API method is used -- This message was sent by Atlassian Jira (v8.20.10#820010)