Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Andrzej Bialecki Thu, 15 Dec 2005 01:54:04 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
I tested it on a 5 mln index.
Thanks, this is great data!

Can you please tell a bit more about the experiments?  In particular:
. How were scores assigned to pages? Link analysis? log(number ofincoming links) or OPIC?



log()

 . How were the queries generated?  From a log or randomly?

Queries have been picked up manually, to test the worst performing casesfrom a real query log.

 . How many queries did you test with?

A couple, not many. This of course needs to be improved, but I didn'twant to spend too much time until we polish the code.

 . When results differed greatly, did they look a lot worse?



Yes. E.g. see the differences for MAX_HITS=10000

My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.
For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e.queries, which executed in e.g. 500 ms now executed in 10-20ms(perfRate=40). Following the intuition, performance drops as weincrease MAX_HITS, until it reaches a more or less original values(perfRate=1) for MAX_HITS=300000 (for a 5 mln doc index). After that,increasing MAX_HITS actually worsens the performance (perfRate << 1)- which can be explained by the fact that the standard HitCollectordoesn't collect as many documents, if they score too low.
This doesn't make sense to me. It should never be slower. We're notactually keeping track of any more hits, only stopping earlier.



Yes, that's puzzling... Please read below.

* Two-term Nutch queries result in complex Lucene BooleanQueries overmany index fields, includng also PhraseQueries. These fared muchworse than single-term queries: actually, the topN values were verylow until MAX_HITS was increased to large values, and then all of asudden all topN-s flipped into the 80-90% ranges.
It would be interesting to try altering the generated query, to see ifit is the phrases or simply multiple terms which cause problems. Todo this, one could hack the query-basic plugin, or simply alter queryboost parameters. This

I actually forgot to write that I don't use any of Nutch code. Early onI decided to eliminate this part in order to get first the rawperformance from Lucene - but still using the Lucene queriescorresponding to translated Nutch queries.

Please see the attached scripts. You will need to put bsh and lucene onyour classpath, as well as compile and add the LimitingCollector classes(it was not possible to implement this inside bsh scripts, due tovagaries of class overloading in BeanShell). You need to customize thepaths to both indexes inside the script. Then run:

java -cp bsh.jar:lucene.jar:LimitedCollector.jar bsh.Interpretertest-opt.sh "+url:information" 10000

The script will iterate 10 times for a warm-up, and then report thetiming, and matching, and topN results. You will get an HTML table withthe first 100 results.

would help us figure out where the optimization is failing. Suel usedmulti-term queries, but not phrases, so we expect that the phrases arecausing the problem, but it would be good to see for certain. We'vealso never tuned Nutch's phrase matching, so it's also possible thatwe may sometimes over-emphasize the phrase component in scores. Forexample, a slop of 10 might give better results and/or be moreamenable to this optimization.

In several installations I use smaller values of slop (around 20-40).But this is motivated by better quality matches, not by performance, soI didn't test for this...

I also noticed that the values of topN depended strongly on thedocument frequency of terms in the query. For a two-term query, whereboth terms have average document frequency, the topN values startfrom ~50% for low MAX_HITS. For a two-term query where one of theterms has a very high document frequency, the topN values start from0% for low MAX_HITS. See the spreadsheet for details.
Were these actually useful queries? For example, I would not beconcerned if results differed greatly for a query like 'to be', sincethat's not a very useful query. Try searching for 'the the' on Google.

No, these were not stop-words. "High DF" is around 0.3, i.e. the termoccurs in 30% documents; but it's still a valid and useful term.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

import org.apache.lucene.search.*;

public class LimitedCollector extends TopDocCollector {
    private int maxHits;
    
    public LimitedCollector(int numHits, int maxHits) {
      super(maxHits);
      this.maxHits = maxHits;
    }

    public void collect(int doc, float score) {
      if (getTotalHits() >= maxHits)
        throw new LimitExceeded(doc);
      super.collect((int)doc, (float)score);
    }
    public static class LimitExceeded extends RuntimeException {
      public int maxDoc;
      public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }    
    }
}

import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import java.util.*;

class RankEntry {
        public float score1;
        public float score2;
        public int hitid1;
        public int hitid2;
        public String expl1;
        public String expl2;
}

// Location of the original index
index1 = "/home/andrzej/segments/index";
// Location of the optimized index
index2 = "/home/andrzej/segments/index-sorted";
// Analyzer to use. Since we create Lucene queries ourself, we use the simplest
a = new SimpleAnalyzer();

ir1 = IndexReader.open(index1);
ir2 = IndexReader.open(index2);
is1 = new IndexSearcher(ir1);
is2 = new IndexSearcher(ir2);
qp = new QueryParser("content", a);
query = qp.parse(bsh.args[0]);
int count = Integer.parseInt(bsh.args[1]);
long start, delta, delta1;
TopDocs topdocs1, topdocs2;
HashMap res = new HashMap();
for (int i = 0; i < 10; i++) {
        start = System.currentTimeMillis();
        topdocs1 = is1.search(query, null, 100);
        delta = System.currentTimeMillis() - start;
        start = System.currentTimeMillis();
        LimitedCollector collector = new LimitedCollector(100, count);
        LimitedCollector.LimitExceeded exc = null;
        try {
                is2.search(query, null, collector);
        } catch (Throwable t) {
                exc = (LimitedCollector.LimitExceeded)t;
        }
        topdocs2 = collector.topDocs();
        if (exc != null) {
                topdocs2.totalHits = (int)(topdocs2.totalHits * (is2.maxDoc() / 
(float)exc.maxDoc));
        }
        delta1 = System.currentTimeMillis() - start;
        System.err.println(i + ". search: orig=" + delta + " optim=" + delta1);
}
        System.out.println("<html><body><table border=1 cellspacing=0 
style='border: 1px gray solid;'>");
        System.out.println("<tr><td colspan=3 bgcolor=#eeeeff>" + i + ". 
search: orig=" + delta + " optim=" + delta1 + "</td></tr>");
//}
        System.out.println("<tr><td colspan=3>total orig: " + ir1.numDocs() + " 
optim: " + ir2.numDocs()+ "</td></tr>");
        System.out.println("<tr><td colspan=3>hits orig: " + topdocs1.totalHits 
+ " optim: " + topdocs2.totalHits + "</td></tr>");
        start = System.currentTimeMillis();
        // get the first 100 hits from searcher1
        for (j = 0; j < 100; j++) {
                sdoc1 = topdocs1.scoreDocs[j];
                id1 = sdoc1.doc;
                doc1 = ir1.document(id1);
                score1 = sdoc1.score;
                key1 = doc1.get("docNo");
                key1 += "-" + doc1.get("segment");
                RankEntry re = (RankEntry)res.get(key1);
                if (re == null) {
                        re = new RankEntry();
                        res.put(key1, re);
                }
                re.hitid1 = j + 1; // 0 is special, means "no match"
                re.score1 = score1;
                re.expl1 = is1.explain(query, id1).toHtml();
        }
        // get the first 100 hits from searcher2, and try to match
        for (j = 0; j < 100; j++) {
                sdoc2 = topdocs2.scoreDocs[j];
                id2 = sdoc2.doc;
                doc2 = ir2.document(id2);
                score2 = sdoc2.score;
                key2 = doc2.get("docNo");
                key2 += "-" + doc2.get("segment");
                RankEntry re = (RankEntry)res.get(key2);
                //if (re == null) {
                //      re = new RankEntry();
                //      res.put(key2, re);
                //}
                if (re == null) continue;
                re.hitid2 = j + 1; // 0 is special
                re.score2 = score2;
                re.expl2 = is2.explain(query, id2).toHtml();
        }
        
System.out.println("<tr><th>No.</th><th>ORIGINAL</th><th>OPTIMIZED</th></tr>");
        int top10, top50, top100;
        top10 = 0; top50 = 0; top100 = 0;
        // this could be done by sorting by re.hitid1, but I'm lazy...
        for (int i = 0; i < 100; i++) {
                sdoc1 = topdocs1.scoreDocs[i];
                id1 = sdoc1.doc;
                doc1 = ir1.document(id1);
                key1 = doc1.get("docNo");
                key1 += "-" + doc1.get("segment");
                RankEntry re = (RankEntry)res.get(key1);
                System.out.println("<tr valign=top><td>" + i + "<br/>" + key1 + 
"</td><td>score=" + re.score1 + " pos=" + re.hitid1 + "</td><td>");
                if (re.hitid2 == 0) {
                        System.out.println("---");
                } else {
                        System.out.println("pos=" + re.hitid2 + " score=" + 
re.score2 + "<br/>dpos=" + (re.hitid2 - re.hitid1) + " dscore=" + (re.score2 - 
re.score1));
                        if (i < 10) top10++;
                        if (i < 50) top50++;
                        if (i < 100) top100++;
                }
                System.out.println("</td></tr>");
        }
        System.out.println("<tr><th colspan=3 bgcolor=#eeeeff>TOP-N</th></tr>");
        System.out.println("<tr>");
        System.out.println("<td>top10: " + (top10 * 100 / 10) + " %</td>");
        System.out.println("<td>top50: " + (top50 * 100 / 50) + " %</td>");
        System.out.println("<td>top100: " + (top100 * 100 / 100) + " %</td>");
        System.out.println("</tr>");
/*
        System.out.println("<tr><th colspan=3 
bgcolor=#eeeeff>EXPLANATIONS</th></tr>");
        
System.out.println("<tr><th>No.</th><th>ORIGINAL</th><th>OPTIMIZED</th></tr>");
        for (int i = 0; i < 10; i++) {
                sdoc1 = topdocs1.scoreDocs[i];
                id1 = sdoc1.doc;
                doc1 = ir1.document(id1);
                key1 = doc1.get("docNo");
                key1 += "-" + doc1.get("segment");
                RankEntry re = (RankEntry)res.get(key1);
                System.out.println("<tr valign=top><td>" + i + "<br/>" + key1 + 
"</td><td>score=" + re.score1 + " pos=" + re.hitid1 + "<br/>" + re.expl1 + 
"</td><td>");
                if (re.hitid2 == 0) {
                        System.out.println("---");
                } else {
                        System.out.println("pos=" + re.hitid2 + " score=" + 
re.score2 + "<br/>" + re.expl2 + "<br/>dpos=" + (re.hitid2 - re.hitid1) + " 
dscore=" + (re.score2 - re.score1));
                }
                System.out.println("</td></tr>");
        }
*/
        System.out.println("</table></body></html>");
is1.close();
is2.close();
ir1.close();
ir2.close();

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to