[jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch

Stefan Pohl (JIRA) Mon, 26 Nov 2012 13:51:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504158#comment-13504158
 ]


Stefan Pohl commented on LUCENE-4571:
-------------------------------------

Mikhail, Robert, thank you for following up on this and I'm happy to provide 
more details on ideas of how to use the minimumMatch constraint to speed up the 
execution of disjunctive queries.

Currently, one heap is used in disjunctive scorers to keep track of the next 
docid (next candidate), and only then these candidates are ruled out if below 
minimum number of terms / percentage of terms. If there is only one stop word 
or otherwise high docfreq term in the query, this will produce a huge number of 
candidates that only occur in very few query terms and only afterwards will be 
ruled out by the constraint again. As Robert pointed out, this leads to long 
processing times to attain the next valid candidate. Using BooleanScorer will 
probably only give improvement by a factor (in some cases), but also this 
scorers would still generate all these useless candidates.
It would be much more efficient to actively use the constraint as part of query 
processing and employ inverted list skipping (leap-frogging), which makes 
conjunctive queries so efficient, also for these kind of queries.

An efficient implementation could look like the following:
Assume at least 5 out of 20 query terms have to match and imagine the 
terms/inverted lists to be sorted by the next docid. Then, the first potential 
candidate that satisfies the constraint would be the docid at position 5 and 
all of the inverted list at position 0-4 can be safely advanced and have to 
match to that docid in order to generate a real candidate worth scoring. The 
order, in which to try advancing terms 0-4 should probably be guided by the 
sparsity of the lists (guided by either docfreq, if available, or a heuristic 
such as the distance to the next docid), that is, inverted list 4 should first 
be advanced to the docid, then list 3,2,1, if previous advances are successful. 
Otherwise, the algorithm can start all over and try the next docid at 5th 
position. This leads to effective leap-frogging on the most sparse lists within 
the constraint which rules out matches not satisfying the constraint, most of 
the time without even touching the very high-frequency terms. Improvements all 
depend on the type of queries and number of docs per index.

There are different implementations possible to achieve such a behavior:
1) Using the current heap and pull out the first 5 lists to get to the next 
candidate docid to which the other 4 then have to be advanced. This probably 
leads to more heap removal and addition operations than necessary with the next 
approaches.
2) Always keep the first 5 lists fully sorted and use the heap to keep track of 
the next docid within lists 6-20. This invariant should simplify implementation 
and be competitive for small minimumMatch.
3) Use two heaps: A max-heap to keep track of the largest docid within the 
first 5 inverted lists and a min-heap to keep track of the smallest docid 
within lists 6-20. This approach looks most efficient.

These implementations look like generalizations to pure disjunctive queries, 
but if there should be any impact they could be implemented as specializations 
used only for queries with a minimum match constraint.
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole 
> disjunction, and verifies minShouldMatch condition [on every 
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the 
> heap first, and then push them back advancing behind that top doc. For me the 
> question no.1 is there a performance test for minShouldMatch constrained 
> disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4571) speedup disjunction with minShouldMatch

Reply via email to