[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000972#comment-14000972 ]
Hudson commented on NUTCH-1752: ------------------------------- FAILURE: Integrated in Nutch-nutchgora #1015 (See [https://builds.apache.org/job/Nutch-nutchgora/1015/]) NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1594071) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java > cache robots.txt rules per protocol:host:port > --------------------------------------------- > > Key: NUTCH-1752 > URL: https://issues.apache.org/jira/browse/NUTCH-1752 > Project: Nutch > Issue Type: Bug > Components: protocol > Affects Versions: 1.8, 2.2.1 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1752-v1.patch, NUTCH-1752-v2.patch > > > HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" > (before NUTCH-1031 caching was per "host" only). The caching should be per > "protocol:host:port". In doubt, a request to a different port may deliver a > different {{robots.txt}}. > Applying robots.txt rules to a combination of host, protocol, and port is > common practice: > [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not > mention this explicitly (could be derived from examples) but others do: > * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and > port needs its own robots.txt file" > * [Google > webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]: > "The directives listed in the robots.txt file apply only to the host, > protocol and port number where the file is hosted." -- This message was sent by Atlassian JIRA (v6.2#6252)