Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Doug Cutting Thu, 15 Dec 2005 10:00:37 -0800

Andrzej Bialecki wrote:

Doug Cutting wrote:
The graph just shows that they differ, not how much better or worsethey are, since the baseline is not perfect. When the top-10 is 50%different, are those 5 different hits markedly worse matches to youreye than the five they've displaced, or are they comparable? That'swhat really matters.
Hmm. I'm not sure I agree with this. Your reasoning would be true if wewere changing the ranking formula. But the goal IMHO with these patchesis to return equally complete results, using the same ranking formula.

But we should not assume that the ranking formula is perfect. Imagine acase where the high-order bits of the score are correct and thelow-order bits are random. Then an optimization which changes localorderings does not actually affect result quality.

I specifically avoided using normalized scores, instead using theabsolute scores in TopDocs. And the absolute scores in both cases areexactly the same, for those results that are present.
What is wrong is that some results that should be there (judging by theranking) are simply missing. So, it's about the recall, and the baselineindex gives the best estimate.

Yes, this optimization, by definition, hurts recall. The only questionis does it substantially hurt relevance at, e.g., 10 hits. If thetop-10 are identical then the answer is easy: no, it does not. But ifthey differ, we can only answer this by looking at results. Chances arethey're worse, but how much? Radically? Slightly? Noticiably?

What part of Nutch are you trying to avoid? Perhaps you could trymeasuring your Lucene-only benchmark against a Nutch-based one. Ifthey don't differ markedly then you can simply use Nutch, which makesit a stronger benchmark. If they differ, then we should figure out why.
Again, I don't see it this way. Nutch results will always be worse thanpure Lucene, because of the added layers. If I can't improve theperformance in Lucene code (which takes > 85% time for every query) thenno matter how well optimized Nutch code is it won't get far.

But we're mostly modifying Nutch's use of Lucene, not modifying Lucene.So measuring Lucene alone won't tell you everything, and you'll keephaving to port Nutch stuff. If you want to, e.g., replay a large querylog to measure average performance, then you'll need things likeauto-filterization, n-grams, query plugins, etc., no?

In several installations I use smaller values of slop (around 20-40).But this is motivated by better quality matches, not by performance,so I didn't test for this...
But that's a great reason to test for it! If lower slop can improveresult quality, then we should certainly see if it also makesoptimizations easier.
I forgot to mention this - the tests I ran already used the smallervalues: the slop was set to 20.

Are they different if the slop is Integer.MAX_VALUE? It would be reallygood to determine what causes results to diverge, whether it is multipleterms (probably not) phrases (probably) and/or slop (perhaps). Chancesare that the divergence is bad, that results are adversely affected, andthat we need to try to fix it. But to do so we'll need to understand it.

That's another advantage of using Lucene directly in this script - youcan provide any query structure on the command-line without changing thecode in Nutch.

But that just means that we should set the SLOP constant inBasicQueryFilter.java from a configuration property, and permit thesetting of configuration properties from the command line, no?


Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to