[ http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330858 ]
Doug Cutting commented on NUTCH-98: ----------------------------------- Where is there a specification of robots.txt that defines how 'allow' and 'disallow' lines interact? I can't even find anything that specifies the semantics of 'allow' lines at all! > RobotRulesParser interprets robots.txt incorrectly > -------------------------------------------------- > > Key: NUTCH-98 > URL: http://issues.apache.org/jira/browse/NUTCH-98 > Project: Nutch > Type: Bug > Components: fetcher > Versions: 0.7 > Reporter: Jeff Bowden > Priority: Minor > Attachments: RobotRulesParser.java.diff > > Here's a simple example that the current RobotRulesParser gets wrong: > User-agent: * > Disallow: / > Allow: /rss > The problem is that the isAllowed function takes the first rule that matches > and incorrectly decides that URLs starting with "/rss" are Disallowed. The > correct algorithm is to take the *longest* rule that matches. I will attach > a patch that fixes this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira