[ https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612467#comment-13612467 ]
Stefan Pohl commented on LUCENE-4872: ------------------------------------- Thanks, Mike, this behaves as expected. Now we have a sense of what trade-off we'd be going for if we agree on the current model, it is still a hard decision though, entailing questions like: - Does it matter that queries that are anyway slow got 2-3 times slower? - Are those queries representative to what users do? A few suggestions for a better model that maybe go beyond the scope of this ticket: A very conservative usage rule for MSMSumScorer would be to use it only if the constraint is at least one higher than the number of high-freq terms, then it will always "kick butt" and we'd get most bang of this scorer without having slow-downs. But we'd miss out on many cases where it would be faster and those might be the ones that are used in practice by users, and it is not clear (to me:-) what 'high-freq' means. If at all, this should be seen relative to the highest-freq subclause. More generally, it seems to me the problem we're trying to solve here is identical to computing a cost. If the cost returned by Scorers correlates with execution time, then we could simply call the cost() method on BS and MSMSumScorer and use MSMSumScorer if it is significantly below the former (assuming there are no side-effects in doing these calls). So we'd defer the problem to the individual Scorers, which splits the problem up into smaller subproblems and the Scorers know themselves best about their implementation and behavior. To make accurate decisions, we probably have to extend the cost-API to return more detailed information to base decision rules on, e.g. upper bound, lower bound (to be able to make conservative/speculative decisions) and estimate the number of returned docs *and* runtime-correlated cost (in some unit). For instance, MSMSumScorer's overall cost depends on both of the latter and can be split up into the following 2 stages: 1) Candidate generation = heap-based merge of clause subset, i.e. the same as for DisjSumScorer, but on a clause subset: time to generate all docs from subScorer: correlates with sum over costs of #clauses-(mm-1) least-costly subScorers # candidates = [max(...), min(sum(...), maxdoc)], where ... can be either an upper bound, lower bound or an estimate in between of the #candidates returned by the #clauses-(mm-1) subScorers Even for TermScorer, the definition of these two measures are not identical due to the min(..., maxdoc). 2) Full scoring of candidates: time to advance() and decode postings: (mm-1) * # candidates The costs would still have to be weighted by the relative overhead of 1) heap-merging, 2) advance() + early-stopping; not sure, if constants are enough here. While the scope of this topic seems large (modelling all scorers), I currently don't see a simpler way to make this reliably work for arbitrarily structured queries, think of MSM(subtree1, Disj(MSM(Conj(...)))). > BooleanWeight should decide how to execute minNrShouldMatch > ----------------------------------------------------------- > > Key: LUCENE-4872 > URL: https://issues.apache.org/jira/browse/LUCENE-4872 > Project: Lucene - Core > Issue Type: Sub-task > Components: core/search > Reporter: Robert Muir > Fix For: 5.0, 4.3 > > Attachments: crazyMinShouldMatch.tasks > > > LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch > which can use advance() behind the scenes. > In cases where you have some really common terms and some rare ones this can > be a huge performance improvement. > On the other hand BooleanScorer might still be faster in some cases. > We should think about what the logic should be here: one simple thing to do > is to always use the new scorer when minShouldMatch is set: thats where i'm > leaning. > But maybe we could have a smarter heuristic too, perhaps based on cost() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org