[ http://issues.apache.org/jira/browse/NUTCH-331?page=comments#action_12452194 ] Doğacan Güney commented on NUTCH-331: -------------------------------------
You obviously know about this a lot more than I do, but looking at fetcher code I can't see how this is possible. Javadoc for MapRunnable.run() says: "Called to execute mapping. Mapping is complete when this returns." and Fetcher.run() (which implements MapRunnable.run()) returns when activeThreads are zero (or fetcher did not make a request for a very long time, which seems to be irrevelant to this). Unless there is an exception, activeThreads only get decremented as individual FetcherThreads try to get new urls from the fetchlist and realize there is not any more input (which means that that particular FetcherThread is not currently fetching anything). So, AFAICS task progress can be 1.0, but fetcher will wait until all threads finish fetching and decrease activeThreads. > Fetcher incorrectly reports task progress to tasktracker resulting in skipped > URLs > ---------------------------------------------------------------------------------- > > Key: NUTCH-331 > URL: http://issues.apache.org/jira/browse/NUTCH-331 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8, 0.9.0 > Reporter: Andrzej Bialecki > Priority: Critical > Fix For: 0.9.0 > > > Each Fetcher task starts multiple FetcherThreads, which consume the input > fetchlist. These threads may block for a long time after being started and > after reading their input fetchlist entries, due to "politeness" settings. > However, the map-reduce framework considers the task as complete when all > input data is read. > This causes the tasktracker to incorreclty assume that task processing is > complete (because the task progress is 1.0, since all input has been > consumed), whereas many URLs from the fetchlist may still be waiting for > fetching, in blocked threads. The more threads is used the more apparent is > this problem, because the final number of fetched pages may be short of the > target number by as many as (numThreads * numMapTasks) entries. > The final result of this is that only a part of the fetchlist is fetched, > because Fetcher map tasks are stopped when their progress is 1.0. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
