I tried a variant of the submitted code with another test case that I am using for a somewhat different class of problem, which had 15,000 terms, and it passed that with much better performance than the current BooleanQuery code. I have not tried it with RangeQuery or PrefixQuery however. The tree is indeed unbalanced; I could not come up with any reasonable scenario where it could systematically cause a linearity. If it did, the performance would be comparable to the current BooleanScorer, so it's still an improvement. The performance numbers in this case for a single test are not the only measure of the improvement. It is quite possible that this code will be even slightly slower on small test cases. However, if you put 10 million documents into the system, this code will definitely be an improvement in that it does not loop in any amount that depends on the number of indexed documents. Karl
Paul Elschot <[EMAIL PROTECTED]> wrote: On Sunday 22 May 2005 03:09, Karl Wright wrote: > I've been looking at the BooleanScorer code in 1.4.3 and realized that it has several problems. These are: > > 1) It does things in chunks of 1024 document ids. This means it executes in a time that depends on the number of indexed documents. > 2) Finding the subscorer with the lowest document id scales linearly with the number of scorers (corresponding to clauses in the Boolean query) > 3) It does not implement the skipTo() method, because its technique of doiing 1024 document id's at a time interferes with this. This makes it impossible to use a BooleanScorer within a Conjunction Scorer. > > I've attached a rewritten BooleanScorer which solves these problems. It basically uses a btree to keep the individual subscorers, and it removes subscorers that have reached the end of their documents. It thus removes the dependency on the number of documents indexed, and it performs in O(log(number of clauses)) instead of O(number of clauses). Great. For disjunctions the BooleanScorer in the svn trunk implements skipTo and it uses a priority queue (a heap) so the base of the log is 2. Is the btree faster than that? I had a short look at the code of addToTree, and the btree seems to be unbalanced, so it could theoretically degrade to linear performance, but in practice it could be good. Did you try it with a PrefixQuery or a RangeQuery that expands to a lot of terms? One way to find out about performance of various boolean scorers is to start from the TestDisjunctionPerf1.java here: http://issues.apache.org/bugzilla/show_bug.cgi?id=34193 This might involve some class renaming in order to be able to run under the same class loader. Working in chunks as the 1.4.3 scorer does still has the best performance afaik. The svn trunk also has a test case TestBoolean2 in the test package org.apache.lucene.search. I hope the btree version passes this, it caught a few initial mistakes in the current BooleanScorer2. Still, I think there is a shortage of test cases for the boolean query and scorers. The version posted here: http://issues.apache.org/bugzilla/show_bug.cgi?id=34154 splits off the coordination computations in case they are not needed, which might affect performance also. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com