Sebastian Nagel created NUTCH-3177:
--------------------------------------
Summary: Fetcher to report idle threads not as hung threads
Key: NUTCH-3177
URL: https://issues.apache.org/jira/browse/NUTCH-3177
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.22
Reporter: Sebastian Nagel
Fix For: 1.23
If there is no URL fetched during half of the MapReduce task timeout, Fetcher
is shutting down to avoid that the fetcher map task fails because of missing
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as
"hung threads" together with the fetched URL. This should allow to debug the
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of
FetcherThread is used. However, the field is not reset after a fetch is done.
In consequence, the reported URL is not necessarily one where the fetch is in
process. It might a the URL that was fetched last, but the thread is no idle
and waiting for the next fetch item to be ready. This happens if there are
still fetch queues, but with long delays because of a robots.txt Crawl-delay or
a longer delay because of the exponential back-off.
FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle
FetcherThread shouldn't be reported as hanging.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)