[
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504158#comment-13504158
]
Stefan Pohl commented on LUCENE-4571:
-------------------------------------
Mikhail, Robert, thank you for following up on this and I'm happy to provide
more details on ideas of how to use the minimumMatch constraint to speed up the
execution of disjunctive queries.
Currently, one heap is used in disjunctive scorers to keep track of the next
docid (next candidate), and only then these candidates are ruled out if below
minimum number of terms / percentage of terms. If there is only one stop word
or otherwise high docfreq term in the query, this will produce a huge number of
candidates that only occur in very few query terms and only afterwards will be
ruled out by the constraint again. As Robert pointed out, this leads to long
processing times to attain the next valid candidate. Using BooleanScorer will
probably only give improvement by a factor (in some cases), but also this
scorers would still generate all these useless candidates.
It would be much more efficient to actively use the constraint as part of query
processing and employ inverted list skipping (leap-frogging), which makes
conjunctive queries so efficient, also for these kind of queries.
An efficient implementation could look like the following:
Assume at least 5 out of 20 query terms have to match and imagine the
terms/inverted lists to be sorted by the next docid. Then, the first potential
candidate that satisfies the constraint would be the docid at position 5 and
all of the inverted list at position 0-4 can be safely advanced and have to
match to that docid in order to generate a real candidate worth scoring. The
order, in which to try advancing terms 0-4 should probably be guided by the
sparsity of the lists (guided by either docfreq, if available, or a heuristic
such as the distance to the next docid), that is, inverted list 4 should first
be advanced to the docid, then list 3,2,1, if previous advances are successful.
Otherwise, the algorithm can start all over and try the next docid at 5th
position. This leads to effective leap-frogging on the most sparse lists within
the constraint which rules out matches not satisfying the constraint, most of
the time without even touching the very high-frequency terms. Improvements all
depend on the type of queries and number of docs per index.
There are different implementations possible to achieve such a behavior:
1) Using the current heap and pull out the first 5 lists to get to the next
candidate docid to which the other 4 then have to be advanced. This probably
leads to more heap removal and addition operations than necessary with the next
approaches.
2) Always keep the first 5 lists fully sorted and use the heap to keep track of
the next docid within lists 6-20. This invariant should simplify implementation
and be competitive for small minimumMatch.
3) Use two heaps: A max-heap to keep track of the largest docid within the
first 5 inverted lists and a min-heap to keep track of the smallest docid
within lists 6-20. This approach looks most efficient.
These implementations look like generalizations to pure disjunctive queries,
but if there should be any impact they could be implemented as specializations
used only for queries with a minimum match constraint.
> speedup disjunction with minShouldMatch
> ----------------------------------------
>
> Key: LUCENE-4571
> URL: https://issues.apache.org/jira/browse/LUCENE-4571
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 4.1
> Reporter: Mikhail Khludnev
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole
> disjunction, and verifies minShouldMatch condition [on every
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
> public int nextDoc() throws IOException {
> assert doc != NO_MORE_DOCS;
> while(true) {
> while (subScorers[0].docID() == doc) {
> if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
> heapAdjust(0);
> } else {
> heapRemoveRoot();
> if (numScorers < minimumNrMatchers) {
> return doc = NO_MORE_DOCS;
> }
> }
> }
> afterNext();
> if (nrMatchers >= minimumNrMatchers) {
> break;
> }
> }
>
> return doc;
> }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the
> heap first, and then push them back advancing behind that top doc. For me the
> question no.1 is there a performance test for minShouldMatch constrained
> disjunction.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]