The BooleanScorer2 code in the svn trunk looks like it has been entirely rewritten over the code that this submission is based upon (which was the 1.4.3 stuff). The techniques used may still be applicable, but would apply within one or more of the specialized scorers instead: DisjunctionScorer, ReqOptScorer, etc. Karl
Paul Elschot <[EMAIL PROTECTED]> wrote: On Sunday 22 May 2005 03:09, Karl Wright wrote: > I've been looking at the BooleanScorer code in 1.4.3 and realized that it has several problems. These are: > > 1) It does things in chunks of 1024 document ids. This means it executes in a time that depends on the number of indexed documents. > 2) Finding the subscorer with the lowest document id scales linearly with the number of scorers (corresponding to clauses in the Boolean query) > 3) It does not implement the skipTo() method, because its technique of doiing 1024 document id's at a time interferes with this. This makes it impossible to use a BooleanScorer within a Conjunction Scorer. > > I've attached a rewritten BooleanScorer which solves these problems. It basically uses a btree to keep the individual subscorers, and it removes subscorers that have reached the end of their documents. It thus removes the dependency on the number of documents indexed, and it performs in O(log(number of clauses)) instead of O(number of clauses). Great. For disjunctions the BooleanScorer in the svn trunk implements skipTo and it uses a priority queue (a heap) so the base of the log is 2. Is the btree faster than that? I had a short look at the code of addToTree, and the btree seems to be unbalanced, so it could theoretically degrade to linear performance, but in practice it could be good. Did you try it with a PrefixQuery or a RangeQuery that expands to a lot of terms? One way to find out about performance of various boolean scorers is to start from the TestDisjunctionPerf1.java here: http://issues.apache.org/bugzilla/show_bug.cgi?id=34193 This might involve some class renaming in order to be able to run under the same class loader. Working in chunks as the 1.4.3 scorer does still has the best performance afaik. The svn trunk also has a test case TestBoolean2 in the test package org.apache.lucene.search. I hope the btree version passes this, it caught a few initial mistakes in the current BooleanScorer2. Still, I think there is a shortage of test cases for the boolean query and scorers. The version posted here: http://issues.apache.org/bugzilla/show_bug.cgi?id=34154 splits off the coordination computations in case they are not needed, which might affect performance also. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com