[ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
-------------------------------

    Description: 
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (ie. *) to it in the end. This is sent to 
crawler commons while parsing the rules. The wildcard gets matched first in 
robots file with (User-agent: *) if that comes before any other matching rule 
thus resulting in a allowed url being robots denied. 

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

  was:
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard '*' to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard '*' added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E


> RobotRulesParser adds additional '*' to the robots name
> -------------------------------------------------------
>
>                 Key: NUTCH-1715
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1715
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>             Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to