> 1. A related question: Is it possible to index without a crawldb but still > use LinkRank scores?
ElasticSearch and Solr nowadays support field-level updates. Given that the link rank calculation is computationally expensive and is not run anew after a segment was fetched, it may be better to update the scores later from CrawlDb. Afaik, needs to be implemented. > 2. ... I should use an "identity scoring filter", that just reads the score > from the CrawlDatum, That's done by using the plugin "scoring-link" which only passes the scores from CrawlDb where needed (including the indexing step). On 06/13/2016 10:15 PM, Joseph Naegele wrote: > Sebastian, > > Thanks! That explains a lot. We're computing LinkRank and I don't specify a > LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very > interested in NUTCH-2184. I'm planning to finish helping with > https://github.com/apache/nutch/pull/95. > > 1. A related question: Is it possible to index without a crawldb but still > use LinkRank scores? This is exactly what I need to do. > > 2. On a similar note, I believe there is another issue related to indexing > with LinkRank scores. If no scoring plugins are configured, then > IndexerMapReduce sets each document's "boost" value to 1.0f. I was under the > impression I shouldn't use a scoring filter when computing LinkRank (see > http://www.mail-archive.com/user%40nutch.apache.org/msg14309.html), but in > reality I should use an "identity scoring filter", that just reads the score > from the CrawlDatum, correct? (If so, then I've answered question 1: crawldb > *is* necessary for indexing with LinkRank) > > Thanks, > Joe > > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, June 13, 2016 13:35 > To: user@nutch.apache.org > Subject: Re: improving distributed indexing performance > > Hi Joseph, > > you're right the mapper does not do much, all potentially heavy computations > in the index or scoring filters are run in the reduce step. > >> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 > There are 5 billion records passed through the map step: > Map input records=5115370813 > Map output records=5115370813 > Reduce input records=5115370813 > Reduce output records=2401924 > > That would mean that either your segment contains a large number of > "unindexable" documents or crawldb and/or linkdb are quite large. > In the latter case, you could try not to use them for indexing. > LinkDb is optional since long, for the CrawlDb there is > https://issues.apache.org/jira/browse/NUTCH-2184 > > Sebastian > > On 06/13/2016 06:55 PM, Joseph Naegele wrote: >> Hi folks, >> >> I'm in the process of indexing a large number of docs using Nutch 1.11 >> and the indexer-elastic plugin. I've observed slow indexing >> performance and narrowed it down to the map phase and first part of >> the reduce phase taking 80% of the total runtime per segment. Here are some >> statistics: >> >> - Average segment contains around 2.4M "indexable" URLs, meaning >> successfully fetched and parsed. >> - Using a 9-datanode Hadoop cluster running on 4 CPU, 16 GB RAM EC2 >> machines. >> - Time to index 2.4M URLs (one segment): around 3.25 hours. >> - Actual time spent sending docs to Elasticsearch: .75 hours >> - No additional indexing options specified, i.e. *not* filtering or >> normalizing URLs, etc. >> >> This means that for every segment, 2.5 hours is spent splitting >> inputs, shuffling, whatever else Hadoop does, and only about 40 >> minutes is spent actually sending docs to ES. From another >> perspective, that means we can expect an indexing rate of 1000 >> docs/sec, but the effective rate is only 200 docs/sec. >> >> I fully understand Nutch's indexer code, so I know that it actually >> does very little in both the Map and Reduce phase (the map phase does >> almost nothing since I'm not filtering/normalizing URLs), so my best >> guess is that there's just a ton of Hadoop overhead. Is it possible to >> optimize this? >> >> I've included below a link to a gist containing job output and >> counters for a single segment, hoping that it will provide some hints. >> For example, is it normal that indexing segments of this size requires >>> 5000 input splits? I imagine that's far too many Map tasks. >> >> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 >> >> Thanks for taking a look, >> Joe >> > >