Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Andrzej Bialecki Wed, 14 Dec 2005 15:16:52 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,at least I get exactly the same results, with the same scores and thesame explanations, if I run the smae queries on the original and onthe sorted index.
Here's a more complete version, still mostly untested. This shouldmake searches faster. We'll see how much good the results are...
This includes a patch to Lucene to make it easier to write hitcollectors that collect TopDocs.
I'll test this on a 38M document index tomorrow.



I tested it on a 5 mln index.

The original index is considered the "baseline", i.e. it representsnormative values for scoring and ranking. These results are compared toresults from the optimized index, and scores and positions of hits arealso recorded. Finally, these two lists are matched, and relativedifferences in scoring and ranking are calculated.

At the end, I calculate the top10 %, top50% and top100%, defined as apercent of the top-N hits from the optimized index, which match thetop-N hits from the baseline index. Ideally, all these measures shouldbe 100%, i.e. all top-N hits from the optimized index should matchcorresponding top-N hits from the baseline index.

One variable which affects greatly both the recall and the performanceis the maximum number of hits considered by the TopDocCollector. In mytests I used values between 1,000 up to 500,000 (which represents 1/10thof the full index in my case).

Now, the results. I collected all test results in a spreadsheet(OpenDocument or PDF format), you can download it from:


   http://www.getopt.org/nutch/20051214/nutchPerf.ods
   http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e.queries, which executed in e.g. 500 ms now executed in 10-20ms(perfRate=40). Following the intuition, performance drops as we increaseMAX_HITS, until it reaches a more or less original values (perfRate=1)for MAX_HITS=300000 (for a 5 mln doc index). After that, increasingMAX_HITS actually worsens the performance (perfRate << 1) - which can beexplained by the fact that the standard HitCollector doesn't collect asmany documents, if they score too low.

* Single-term Nutch queries (i.e. which do not produce LucenePhraseQueries) yield relatively good values of topN, even for relativelysmall values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%.The minimum useful value for my index was MAX_HITS=10000 (perfRate=30),and this yields quite acceptable top10=90%, but less acceptable top50and top100. Please see the spreadsheet for details.

* Two-term Nutch queries result in complex Lucene BooleanQueries overmany index fields, includng also PhraseQueries. These fared much worsethan single-term queries: actually, the topN values were very low untilMAX_HITS was increased to large values, and then all of a sudden alltopN-s flipped into the 80-90% ranges.

I also noticed that the values of topN depended strongly on the documentfrequency of terms in the query. For a two-term query, where both termshave average document frequency, the topN values start from ~50% for lowMAX_HITS. For a two-term query where one of the terms has a very highdocument frequency, the topN values start from 0% for low MAX_HITS. Seethe spreadsheet for details.


Conclusions: more work is needed... ;-)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to