[ 
http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330858 ] 

Doug Cutting commented on NUTCH-98:
-----------------------------------

Where is there a specification of robots.txt that defines how 'allow' and 
'disallow' lines interact?  I can't even find anything that specifies the 
semantics of 'allow' lines at all!

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches 
> and incorrectly decides that URLs starting with "/rss" are Disallowed.  The 
> correct algorithm is to take the *longest* rule that matches.  I will attach 
> a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to