Sebastian Nagel created NUTCH-3177:
--------------------------------------

             Summary: Fetcher to report idle threads not as hung threads
                 Key: NUTCH-3177
                 URL: https://issues.apache.org/jira/browse/NUTCH-3177
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.22
            Reporter: Sebastian Nagel
             Fix For: 1.23


If there is no URL fetched during half of the MapReduce task timeout, Fetcher 
is shutting down to avoid that the fetcher map task fails because of missing 
progress. Before the shut-down Fetcher reports the remaining FetcherThreads as 
"hung threads" together with the fetched URL. This should allow to debug the 
URLs / pages causing timeouts. For the reporting the field {{reprUrl}} of 
FetcherThread is used. However, the field is not reset after a fetch is done. 
In consequence, the reported URL is not necessarily one where the fetch is in 
process. It might a the URL that was fetched last, but the thread is no idle 
and waiting for the next fetch item to be ready. This happens if there are 
still fetch queues, but with long delays because of a robots.txt Crawl-delay or 
a longer delay because of the exponential back-off.

FetcherThread should reset the {{reprUrl}} once a fetch is finished. Idle 
FetcherThread shouldn't be reported as hanging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to