[
http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365446 ]
M.Oliver Scheele commented on NUTCH-205:
----------------------------------------
Thanks for comment.
I'm using the standard properties in my configuration (which shouldn't be
improper by default;)):
fetcher.threads.per.host=1
http.max.delays=3
Here is my deeper analysis/debugging (that should convince you this may be a
bug):
Class Http.java in the plugin protocol-http (line 125):
if (delays == MAX_DELAYS)
throw new RetryLater(url, "Exceeded http.max.delays: retry later.");
That's were the code is running in.
And that's totally ok. The current request ist stopped and the RetryLater
Exception is thrown.
The problem occurs when running the UpdateDatabase Tool (line 119):
} else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
page.getRetriesSinceFetch() < MAX_RETRIES) {
pageRetry(fo); // retry later
} else {
pageGone(fo); // give up: page is gone
}
The fo.getProtocolStatus().getCode() is not eqal ProtocolStatus.RETRY.
In consequence the pageGone(fo) is called.
Summary:
The RetryLater Exception isn't transfered into the correct protocol-status of
the
FetcherOutput object.
> Wrong 'fetch date' for non available pages
> ------------------------------------------
>
> Key: NUTCH-205
> URL: http://issues.apache.org/jira/browse/NUTCH-205
> Project: Nutch
> Type: Bug
> Components: fetcher
> Versions: 0.7, 0.7.1
> Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
> Reporter: M.Oliver Scheele
>
> Web-Pages that couldn't be fetched because of a time-out wouldn't be
> refetched anymore.
> The next fetch in the web-db is set to Long.max.
> Example:
> -------------
> While fetching our URLs, we got some errors like this:
> 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html
> failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
> Exceeded ttp.max.delays: retry later.
> That seems to be ok and indicates some network problems.
> The problem is that the entry in the Webdb shows the following:
> Page 4: Version: 4
> URL: http://www.test-domain.de/crawl_html/page_2.html
> ID: b360ec931855b0420776909bd96557c0
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 0 days
> The 'Next fetch' date is set to the year '292278994'.
> Probably I wouldn't be able to see the refetch alive. ;)
> A page that couldn't be crawled because of networks-problems,
> should be refetched with the next crawl (== set next fetch date current time
> + 1h).
> Possible Bug-Fixing:
> ----------------------------
> When updating the web-db the method updateForSegment() in the
> UpdateDatabaseTool.class,
> set the fetch-date always to Long.max for any (unknown) exception during
> fetching.
> The RETRY status is not always set correctly.
> Change the following lines:
> } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
> page.getRetriesSinceFetch() < MAX_RETRIES) {
> pageRetry(fo); // retry later
> } else {
> pageGone(fo); // give up: page is gone
> }
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers