Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more complete version, still mostly untested. This should
make searches faster. We'll see how much good the results are...
This includes a patch to Lucene to make it easier to write hit
collectors that collect TopDocs.
I'll test this on a 38M document index tomorrow.
I tested it on a 5 mln index.
The original index is considered the "baseline", i.e. it represents
normative values for scoring and ranking. These results are compared to
results from the optimized index, and scores and positions of hits are
also recorded. Finally, these two lists are matched, and relative
differences in scoring and ranking are calculated.
At the end, I calculate the top10 %, top50% and top100%, defined as a
percent of the top-N hits from the optimized index, which match the
top-N hits from the baseline index. Ideally, all these measures should
be 100%, i.e. all top-N hits from the optimized index should match
corresponding top-N hits from the baseline index.
One variable which affects greatly both the recall and the performance
is the maximum number of hits considered by the TopDocCollector. In my
tests I used values between 1,000 up to 500,000 (which represents 1/10th
of the full index in my case).
Now, the results. I collected all test results in a spreadsheet
(OpenDocument or PDF format), you can download it from:
http://www.getopt.org/nutch/20051214/nutchPerf.ods
http://www.getopt.org/nutch/20051214/nutchPerf.pdf
For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e.
queries, which executed in e.g. 500 ms now executed in 10-20ms
(perfRate=40). Following the intuition, performance drops as we increase
MAX_HITS, until it reaches a more or less original values (perfRate=1)
for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing
MAX_HITS actually worsens the performance (perfRate << 1) - which can be
explained by the fact that the standard HitCollector doesn't collect as
many documents, if they score too low.
* Single-term Nutch queries (i.e. which do not produce Lucene
PhraseQueries) yield relatively good values of topN, even for relatively
small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%.
The minimum useful value for my index was MAX_HITS=10000 (perfRate=30),
and this yields quite acceptable top10=90%, but less acceptable top50
and top100. Please see the spreadsheet for details.
* Two-term Nutch queries result in complex Lucene BooleanQueries over
many index fields, includng also PhraseQueries. These fared much worse
than single-term queries: actually, the topN values were very low until
MAX_HITS was increased to large values, and then all of a sudden all
topN-s flipped into the 80-90% ranges.
I also noticed that the values of topN depended strongly on the document
frequency of terms in the query. For a two-term query, where both terms
have average document frequency, the topN values start from ~50% for low
MAX_HITS. For a two-term query where one of the terms has a very high
document frequency, the topN values start from 0% for low MAX_HITS. See
the spreadsheet for details.
Conclusions: more work is needed... ;-)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com