Tejas Patil created NUTCH-1715:
----------------------------------

             Summary: RobotRulesParser adds additional '*' to the robots name
                 Key: NUTCH-1715
                 URL: https://issues.apache.org/jira/browse/NUTCH-1715
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 2.2.1, 1.7
            Reporter: Tejas Patil
            Assignee: Tejas Patil
             Fix For: 2.3, 1.8


In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (*) to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard (*) added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to