[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298: ----------------------------------- Attachment: fixNpeRobotRuleSet.patch fix the npe in RobotRuleSet happen in case we use a empthy RuleSet > if a 404 for a robots.txt is returned no page is fetched at all from the host > ----------------------------------------------------------------------------- > > Key: NUTCH-298 > URL: http://issues.apache.org/jira/browse/NUTCH-298 > Project: Nutch > Type: Bug > Reporter: Stefan Groschupf > Fix For: 0.8-dev > Attachments: fixNpeRobotRuleSet.patch > > What happen: > Is no RobotRuleSet is in the cache for a host, we create try to fetch the > robots.txt. > In case http response code is not 200 or 403 but for example 404 we do " > robotRules = EMPTY_RULES; " (line: 402) > EMPTY_RULES is a RobotRuleSet created with the default constructor. > tmpEntries and entries is null and will never changed. > If we now try to fetch a page from the host that use the EMPTY_RULES is used > and we call isAllowed in the RobotRuleSet. > In this case a NPE is thrown in this line: > if (entries == null) { > entries= new RobotsEntry[tmpEntries.size()]; > possible Solution: > We can intialize the tmpEntries by default and also remove other null checks > and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira