[
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Pohl updated LUCENE-4571:
--------------------------------
Attachment: LUCENE-4571.patch
With your nice test, I was quickly able to track it down (a >= instead of a ==
fixed the issue). This was also present in the previous implementation, but
there it was not wrong. New patch attached.
There are still some assertions failing, but for now I'd attribute them to
problems with the test (are return statements is missing when actual==null?),
because these assertions only fail within the baseline/expected impl.
Lucenebench doesn't complain anymore, even when checking scores.
Instead of re-implementing all of the previous basic implementation, do you
think it would be possible to rather re-use a normal DisjunctionSumScorer and
wrap it with a Scorer that discards candidates that don't match the
mm-constraint? This way we would get correct scoring for free. E.g.
{noformat}
public int nextDoc() throws IOException {
int docid;
while (true) {
docid = wrappedScorer.nextDoc();
if (docid==NO_MORE_DOCS)
break;
// count how many (optional) subscorers match on the same docid,
and check against constraint
int cntOpt = 0;
for (ChildScorer cs : this.subscorers) {
if (cs.child.docID() == docid &&
"SHOULD".equalsIgnoreCase(cs.relationship))
cntOpt++;
}
if (cntOpt>=mm)
break;
}
return docid;
}
{noformat}
I ran the perf test locally on the 1M doc index and got quite nice numbers. As
we anticipated, the cost-API is very beneficial and the now added per-candidate
early-stopping (less advance()-calls at the expense of additional
if-statements) roughly doubles the speedups :)
{noformat}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff
Low4MinShouldMatch0 36.41 (3.1%) 35.18 (5.1%)
-3.4% ( -11% - 5%)
Low3MinShouldMatch0 21.31 (6.6%) 21.18 (3.9%)
-0.6% ( -10% - 10%)
Low2MinShouldMatch0 14.54 (7.4%) 14.45 (3.3%)
-0.6% ( -10% - 10%)
HighMinShouldMatch0 9.85 (9.1%) 9.82 (2.8%)
-0.3% ( -11% - 12%)
Low1MinShouldMatch0 11.68 (8.3%) 11.65 (3.1%)
-0.3% ( -10% - 12%)
HighMinShouldMatch2 10.12 (9.4%) 10.95 (2.8%)
8.2% ( -3% - 22%)
Low1MinShouldMatch2 12.08 (8.6%) 13.60 (3.2%)
12.5% ( 0% - 26%)
Low2MinShouldMatch2 15.23 (7.8%) 18.22 (4.2%)
19.6% ( 7% - 34%)
HighMinShouldMatch3 10.33 (9.2%) 13.21 (3.2%)
27.9% ( 14% - 44%)
Low3MinShouldMatch2 22.72 (7.3%) 30.85 (5.8%)
35.8% ( 21% - 52%)
Low1MinShouldMatch3 12.49 (8.3%) 17.92 (4.4%)
43.5% ( 28% - 61%)
HighMinShouldMatch4 10.55 (9.0%) 18.29 (4.8%)
73.3% ( 54% - 95%)
Low2MinShouldMatch3 16.08 (7.2%) 30.38 (6.8%)
88.9% ( 69% - 110%)
Low1MinShouldMatch4 12.84 (7.9%) 32.68 (9.6%)
154.6% ( 126% - 186%)
Low4MinShouldMatch2 45.39 (3.2%) 222.71 (16.0%)
390.6% ( 360% - 423%)
Low3MinShouldMatch3 24.70 (6.2%) 215.56 (29.8%)
772.9% ( 693% - 862%)
Low4MinShouldMatch3 45.71 (3.1%) 636.23 (53.4%)
1292.0% (1198% - 1390%)
Low2MinShouldMatch4 16.48 (7.1%) 257.47 (47.6%)
1462.7% (1314% - 1633%)
Low4MinShouldMatch4 45.81 (3.3%) 811.82 (60.4%)
1672.1% (1557% - 1794%)
Low3MinShouldMatch4 24.87 (6.8%) 768.67 (119.4%)
2990.2% (2681% - 3344%)
{noformat}
Note that these numbers are not directly comparable to the others in this
ticket, but they give evidence that the new implementation is never (as far as
this one query generalizes) slower than the previous one.
> speedup disjunction with minShouldMatch
> ----------------------------------------
>
> Key: LUCENE-4571
> URL: https://issues.apache.org/jira/browse/LUCENE-4571
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 4.1
> Reporter: Mikhail Khludnev
> Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch,
> LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole
> disjunction, and verifies minShouldMatch condition [on every
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
> public int nextDoc() throws IOException {
> assert doc != NO_MORE_DOCS;
> while(true) {
> while (subScorers[0].docID() == doc) {
> if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
> heapAdjust(0);
> } else {
> heapRemoveRoot();
> if (numScorers < minimumNrMatchers) {
> return doc = NO_MORE_DOCS;
> }
> }
> }
> afterNext();
> if (nrMatchers >= minimumNrMatchers) {
> break;
> }
> }
>
> return doc;
> }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the
> heap first, and then push them back advancing behind that top doc. For me the
> question no.1 is there a performance test for minShouldMatch constrained
> disjunction.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]