Fetcher incorrectly reports task progress to tasktracker resulting in skipped 
URLs
----------------------------------------------------------------------------------

                 Key: NUTCH-331
                 URL: http://issues.apache.org/jira/browse/NUTCH-331
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8-dev, 0.9-dev
            Reporter: Andrzej Bialecki 
            Priority: Critical
             Fix For: 0.8-dev, 0.9-dev


Each Fetcher task starts multiple FetcherThreads, which consume the input 
fetchlist. These threads may block for a long time after being started and 
after reading their input fetchlist entries, due to "politeness" settings. 
However, the map-reduce framework considers the task as complete when all input 
data is read.

This causes the tasktracker to incorreclty assume that task processing is 
complete (because the task progress is 1.0, since all input has been consumed), 
whereas many URLs from the fetchlist may still be waiting for fetching, in 
blocked threads. The more threads is used the more apparent is this problem, 
because the final number of fetched pages may be short of the target number by 
as many as (numThreads * numMapTasks) entries.

The final result of this is that only a part of the fetchlist is fetched, 
because Fetcher map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to