Re: improving distributed indexing performance

Sebastian Nagel Mon, 13 Jun 2016 23:46:28 -0700

> 1. A related question: Is it possible to index without a crawldb but still 
> use LinkRank scores?


ElasticSearch and Solr nowadays support field-level updates. Given that the 
link rank
calculation is computationally expensive and is not run anew after a segment 
was fetched,
it may be better to update the scores later from CrawlDb.  Afaik, needs to be 
implemented.

> 2. ... I should use an "identity scoring filter", that just reads the score 
> from the CrawlDatum,

That's done by using the plugin "scoring-link" which only passes the scores 
from CrawlDb
where needed (including the indexing step).

On 06/13/2016 10:15 PM, Joseph Naegele wrote:
> Sebastian,
> 
> Thanks! That explains a lot. We're computing LinkRank and I don't specify a 
> LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very 
> interested in NUTCH-2184. I'm planning to finish helping with 
> https://github.com/apache/nutch/pull/95.
> 
> 1. A related question: Is it possible to index without a crawldb but still 
> use LinkRank scores? This is exactly what I need to do.
> 
> 2. On a similar note, I believe there is another issue related to indexing 
> with LinkRank scores. If no scoring plugins are configured, then 
> IndexerMapReduce sets each document's "boost" value to 1.0f. I was under the 
> impression I shouldn't use a scoring filter when computing LinkRank (see 
> http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), but in 
> reality I should use an "identity scoring filter", that just reads the score 
> from the CrawlDatum, correct? (If so, then I've answered question 1: crawldb 
> *is* necessary for indexing with LinkRank)
> 
> Thanks,
> Joe
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
> Sent: Monday, June 13, 2016 13:35
> To: user@nutch.apache.org
> Subject: Re: improving distributed indexing performance
> 
> Hi Joseph,
> 
> you're right the mapper does not do much, all potentially heavy computations 
> in the index or scoring filters are run in the reduce step.
> 
>> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
> There are 5 billion records passed through the map step:
>  Map input records=5115370813
>  Map output records=5115370813
>  Reduce input records=5115370813
>  Reduce output records=2401924
> 
> That would mean that either your segment contains a large number of 
> "unindexable" documents or crawldb and/or linkdb are quite large.
> In the latter case, you could try not to use them for indexing.
> LinkDb is optional since long, for the CrawlDb there is
>   https://issues.apache.org/jira/browse/NUTCH-2184
> 
> Sebastian
> 
> On 06/13/2016 06:55 PM, Joseph Naegele wrote:
>> Hi folks,
>>
>> I'm in the process of indexing a large number of docs using Nutch 1.11 
>> and the indexer-elastic plugin. I've observed slow indexing 
>> performance and narrowed it down to the map phase and first part of 
>> the reduce phase taking 80% of the total runtime per segment. Here are some 
>> statistics:
>>
>> - Average segment contains around 2.4M "indexable" URLs, meaning 
>> successfully fetched and parsed.
>> - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 
>> machines.
>> - Time to index 2.4M URLs (one segment): around 3.25 hours.
>> - Actual time spent sending docs to Elasticsearch: .75 hours
>> - No additional indexing options specified, i.e. *not* filtering or 
>> normalizing URLs, etc.
>>
>> This means that for every segment, 2.5 hours is spent splitting 
>> inputs, shuffling, whatever else Hadoop does, and only about 40 
>> minutes is spent actually sending docs to ES. From another 
>> perspective, that means we can expect an indexing rate of 1000 
>> docs/sec, but the effective rate is only 200 docs/sec.
>>
>> I fully understand Nutch's indexer code, so I know that it actually 
>> does very little in both the Map and Reduce phase (the map phase does 
>> almost nothing since I'm not filtering/normalizing URLs), so my best 
>> guess is that there's just a ton of Hadoop overhead. Is it possible to 
>> optimize this?
>>
>> I've included below a link to a gist containing job output and 
>> counters for a single segment, hoping that it will provide some hints. 
>> For example, is it normal that indexing segments of this size requires 
>>> 5000 input splits? I imagine that's far too many Map tasks.
>>
>> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
>>
>> Thanks for taking a look,
>> Joe
>>
> 
>

Re: improving distributed indexing performance

Reply via email to