Hi Joseph, you're right the mapper does not do much, all potentially heavy computations in the index or scoring filters are run in the reduce step.
> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 There are 5 billion records passed through the map step: Map input records=5115370813 Map output records=5115370813 Reduce input records=5115370813 Reduce output records=2401924 That would mean that either your segment contains a large number of "unindexable" documents or crawldb and/or linkdb are quite large. In the latter case, you could try not to use them for indexing. LinkDb is optional since long, for the CrawlDb there is https://issues.apache.org/jira/browse/NUTCH-2184 Sebastian On 06/13/2016 06:55 PM, Joseph Naegele wrote: > Hi folks, > > I'm in the process of indexing a large number of docs using Nutch 1.11 and > the indexer-elastic plugin. I've observed slow indexing performance and > narrowed it down to the map phase and first part of the reduce phase taking > 80% of the total runtime per segment. Here are some statistics: > > - Average segment contains around 2.4M "indexable" URLs, meaning > successfully fetched and parsed. > - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 > machines. > - Time to index 2.4M URLs (one segment): around 3.25 hours. > - Actual time spent sending docs to Elasticsearch: .75 hours > - No additional indexing options specified, i.e. *not* filtering or > normalizing URLs, etc. > > This means that for every segment, 2.5 hours is spent splitting inputs, > shuffling, whatever else Hadoop does, and only about 40 minutes is spent > actually sending docs to ES. From another perspective, that means we can > expect an indexing rate of 1000 docs/sec, but the effective rate is only 200 > docs/sec. > > I fully understand Nutch's indexer code, so I know that it actually does > very little in both the Map and Reduce phase (the map phase does almost > nothing since I'm not filtering/normalizing URLs), so my best guess is that > there's just a ton of Hadoop overhead. Is it possible to optimize this? > > I've included below a link to a gist containing job output and counters for > a single segment, hoping that it will provide some hints. For example, is it > normal that indexing segments of this size requires > 5000 input splits? I > imagine that's far too many Map tasks. > > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 > > Thanks for taking a look, > Joe >