Re: improving distributed indexing performance

Sebastian Nagel Mon, 13 Jun 2016 10:35:23 -0700

Hi Joseph,

you're right the mapper does not do much, all potentially
heavy computations in the index or scoring filters are run
in the reduce step.


> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
There are 5 billion records passed through the map step:
 Map input records=5115370813
 Map output records=5115370813
 Reduce input records=5115370813
 Reduce output records=2401924

That would mean that either your segment contains a large number of
"unindexable" documents or crawldb and/or linkdb are quite large.
In the latter case, you could try not to use them for indexing.
LinkDb is optional since long, for the CrawlDb there is
  https://issues.apache.org/jira/browse/NUTCH-2184

Sebastian

On 06/13/2016 06:55 PM, Joseph Naegele wrote:
> Hi folks,
> 
> I'm in the process of indexing a large number of docs using Nutch 1.11 and
> the indexer-elastic plugin. I've observed slow indexing performance and
> narrowed it down to the map phase and first part of the reduce phase taking
> 80% of the total runtime per segment. Here are some statistics:
> 
> - Average segment contains around 2.4M "indexable" URLs, meaning
> successfully fetched and parsed.
> - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2
> machines.
> - Time to index 2.4M URLs (one segment): around 3.25 hours.
> - Actual time spent sending docs to Elasticsearch: .75 hours
> - No additional indexing options specified, i.e. *not* filtering or
> normalizing URLs, etc.
> 
> This means that for every segment, 2.5 hours is spent splitting inputs,
> shuffling, whatever else Hadoop does, and only about 40 minutes is spent
> actually sending docs to ES. From another perspective, that means we can
> expect an indexing rate of 1000 docs/sec, but the effective rate is only 200
> docs/sec.
> 
> I fully understand Nutch's indexer code, so I know that it actually does
> very little in both the Map and Reduce phase (the map phase does almost
> nothing since I'm not filtering/normalizing URLs), so my best guess is that
> there's just a ton of Hadoop overhead. Is it possible to optimize this?
> 
> I've included below a link to a gist containing job output and counters for
> a single segment, hoping that it will provide some hints. For example, is it
> normal that indexing segments of this size requires > 5000 input splits? I
> imagine that's far too many Map tasks.
> 
> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> 
> Thanks for taking a look,
> Joe
>

Re: improving distributed indexing performance

Reply via email to