[ https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefan Pohl updated LUCENE-4571: -------------------------------- Attachment: LUCENE-4571.patch Robert, thank you for the excellent feedback. I didn't look at the BS in detail for a while, and all you say sounds very reasonable. My statement "improvement by a factor" was under the assumption that BS would similarly to BS2 generate all OR-candidates and only afterwards screen out many of them again due to the minimum-match constraint. If that's the case and we assume BS to be faster than BS2 by some factor, then it will be the very same factor faster with a larger collection, whereas an optimized BS2 might become faster than BS because it does not generate many useless candidates for queries that have only one super-common term (proportional to document collection size) and a minimum-match constraint of at least 2. Hope this makes sense now, but of course my assumption might be wrong :) In fact, I got to think quite a bit about different approaches on how to implement a minimum-match optimized version of BS2 and converged on implementation approach 1) due to the others adding other expensive operations/overhead when getting down to all details. The attached patch contains a full drop-in replacement for the DisjunctionSumScorer in DisjunctionSumScorerMM.java and accordingly changes references within BooleanScorer2. All existing tests pass. As this scorer is supposed to work with any subclauses (not only TermQuery), I decided for an implementation that dynamically orders the first mm-1 subscorers by the next docid, hence exploiting local within-inverted-list docid distributions. Fixing the mm-1 subscorers on basis of their doc frequency/sparseness estimation could be better (less heap operations, but no exploitation of within-list docid distributions), but this is currently only available for TermQuery as clauses and would hence limit the applicability of the implementation. Having an API to determine the sparseness of a subscorer already came up in some other tickets and many other Scorer implementations could similarly benefit from its availability. I however share your thinking about not intermingling too many aspects within one Scorer, making it overly complex and probably less amenable for VM optimizations (e.g. not as tight loops). This is why I implemented it in a different class, so you could go ahead and remove the mm-constraint from DisjunctionSumScorer and use either implementation depending on the given query. Also, this implementation could still be tested response-time-wise for different representative queries of interest and different mm-constraint settings (luceneutil?). I wouldn't be surprised if my implementation is not as quick as DisjunctionSumScorer when mm=1, but I have anecdotal evidence that it does a great job (speedups of 2-3) for higher mm (and longer queries). I don't quite see why the attached implementation would do more heap operations, but I agree that it could be slower due to lengthier/more complex loops, a few more if statements etc. Hope this contribution helps. > speedup disjunction with minShouldMatch > ---------------------------------------- > > Key: LUCENE-4571 > URL: https://issues.apache.org/jira/browse/LUCENE-4571 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Affects Versions: 4.1 > Reporter: Mikhail Khludnev > Attachments: LUCENE-4571.patch > > > even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole > disjunction, and verifies minShouldMatch condition [on every > doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]: > {code} > public int nextDoc() throws IOException { > assert doc != NO_MORE_DOCS; > while(true) { > while (subScorers[0].docID() == doc) { > if (subScorers[0].nextDoc() != NO_MORE_DOCS) { > heapAdjust(0); > } else { > heapRemoveRoot(); > if (numScorers < minimumNrMatchers) { > return doc = NO_MORE_DOCS; > } > } > } > afterNext(); > if (nrMatchers >= minimumNrMatchers) { > break; > } > } > > return doc; > } > {code} > [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the > heap first, and then push them back advancing behind that top doc. For me the > question no.1 is there a performance test for minShouldMatch constrained > disjunction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org