Doug Cutting wrote:

Andrzej Bialecki wrote:

Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index.


Here's a more complete version, still mostly untested. This should make searches faster. We'll see how much good the results are...

This includes a patch to Lucene to make it easier to write hit collectors that collect TopDocs.

I'll test this on a 38M document index tomorrow.


I tested it on a 5 mln index.

The original index is considered the "baseline", i.e. it represents normative values for scoring and ranking. These results are compared to results from the optimized index, and scores and positions of hits are also recorded. Finally, these two lists are matched, and relative differences in scoring and ranking are calculated.

At the end, I calculate the top10 %, top50% and top100%, defined as a percent of the top-N hits from the optimized index, which match the top-N hits from the baseline index. Ideally, all these measures should be 100%, i.e. all top-N hits from the optimized index should match corresponding top-N hits from the baseline index.

One variable which affects greatly both the recall and the performance is the maximum number of hits considered by the TopDocCollector. In my tests I used values between 1,000 up to 500,000 (which represents 1/10th of the full index in my case).

Now, the results. I collected all test results in a spreadsheet (OpenDocument or PDF format), you can download it from:

   http://www.getopt.org/nutch/20051214/nutchPerf.ods
   http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. queries, which executed in e.g. 500 ms now executed in 10-20ms (perfRate=40). Following the intuition, performance drops as we increase MAX_HITS, until it reaches a more or less original values (perfRate=1) for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing MAX_HITS actually worsens the performance (perfRate << 1) - which can be explained by the fact that the standard HitCollector doesn't collect as many documents, if they score too low.

* Single-term Nutch queries (i.e. which do not produce Lucene PhraseQueries) yield relatively good values of topN, even for relatively small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. The minimum useful value for my index was MAX_HITS=10000 (perfRate=30), and this yields quite acceptable top10=90%, but less acceptable top50 and top100. Please see the spreadsheet for details.

* Two-term Nutch queries result in complex Lucene BooleanQueries over many index fields, includng also PhraseQueries. These fared much worse than single-term queries: actually, the topN values were very low until MAX_HITS was increased to large values, and then all of a sudden all topN-s flipped into the 80-90% ranges.

I also noticed that the values of topN depended strongly on the document frequency of terms in the query. For a two-term query, where both terms have average document frequency, the topN values start from ~50% for low MAX_HITS. For a two-term query where one of the terms has a very high document frequency, the topN values start from 0% for low MAX_HITS. See the spreadsheet for details.

Conclusions: more work is needed... ;-)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to