[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298:
-----------------------------------
Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a
404 for a robots.txt is returned no page is fetched at all from the host)
Sorry, worng description.
> if a 404 for a robots.txt is returned a NPE is thrown
> -----------------------------------------------------
>
> Key: NUTCH-298
> URL: http://issues.apache.org/jira/browse/NUTCH-298
> Project: Nutch
> Type: Bug
> Reporter: Stefan Groschupf
> Fix For: 0.8-dev
> Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the
> robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do "
> robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used
> and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
> if (entries == null) {
> entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks
> and initialisations.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers