I tried a variant of the submitted code with another test case that I am using 
for a somewhat different class of problem, which had 15,000 terms, and it 
passed that with much better performance than the current BooleanQuery code.  I 
have not tried it with RangeQuery or PrefixQuery however.
 
The tree is indeed unbalanced; I could not come up with any reasonable scenario 
where it could systematically cause a linearity.  If it did, the performance 
would be comparable to the current BooleanScorer, so it's still an improvement.
 
The performance numbers in this case for a single test are not the only measure 
of the improvement.  It is quite possible that this code will be even slightly 
slower on small test cases.  However, if you put 10 million documents into the 
system, this code will definitely be an improvement in that it does not loop in 
any amount that depends on the number of indexed documents.
 
Karl
 
 


Paul Elschot <[EMAIL PROTECTED]> wrote:
On Sunday 22 May 2005 03:09, Karl Wright wrote:
> I've been looking at the BooleanScorer code in 1.4.3 and realized that it 
has several problems. These are:
> 
> 1) It does things in chunks of 1024 document ids. This means it executes in 
a time that depends on the number of indexed documents.
> 2) Finding the subscorer with the lowest document id scales linearly with 
the number of scorers (corresponding to clauses in the Boolean query)
> 3) It does not implement the skipTo() method, because its technique of 
doiing 1024 document id's at a time interferes with this. This makes it 
impossible to use a BooleanScorer within a Conjunction Scorer.
> 
> I've attached a rewritten BooleanScorer which solves these problems. It 
basically uses a btree to keep the individual subscorers, and it removes 
subscorers that have reached the end of their documents. It thus removes the 
dependency on the number of documents indexed, and it performs in 
O(log(number of clauses)) instead of O(number of clauses).

Great. For disjunctions the BooleanScorer in the svn trunk implements
skipTo and it uses a priority queue (a heap) so the base of the log is 2.
Is the btree faster than that?
I had a short look at the code of addToTree, and the btree seems to be
unbalanced, so it could theoretically degrade to linear performance,
but in practice it could be good.
Did you try it with a PrefixQuery or a RangeQuery that expands to a lot
of terms?

One way to find out about performance of various boolean scorers
is to start from the TestDisjunctionPerf1.java here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34193
This might involve some class renaming in order to be able to
run under the same class loader.

Working in chunks as the 1.4.3 scorer does still has the best
performance afaik.

The svn trunk also has a test case TestBoolean2 in the test
package org.apache.lucene.search.
I hope the btree version passes this, it caught a few initial mistakes
in the current BooleanScorer2.
Still, I think there is a shortage of test cases for the boolean query and
scorers.

The version posted here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34154
splits off the coordination computations in case they are not
needed, which might affect performance also.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to