Sebastian Nagel created NUTCH-2652: -------------------------------------- Summary: Fetcher launches more fetch tasks than fetch lists Key: NUTCH-2652 URL: https://issues.apache.org/jira/browse/NUTCH-2652 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.15 Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 5.15.1, Nutch built on recent master.
Seen the first time right now, although running since two months with Nutch 1.15. But the constraints causing inputs to be split may change from run to run. Reporter: Sebastian Nagel Fix For: 1.16 Fetcher may launch more fetcher tasks than there are fetch lists: {noformat} 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187 {noformat} That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See [~ab]'s chapter about Nutch in [Hadoop the definitive guide (3rd edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)