[ 
https://issues.apache.org/jira/browse/LUCENE-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Pohl updated LUCENE-4571:
--------------------------------

    Attachment: LUCENE-4571.patch

With your nice test, I was quickly able to track it down (a >= instead of a == 
fixed the issue). This was also present in the previous implementation, but 
there it was not wrong. New patch attached.
There are still some assertions failing, but for now I'd attribute them to 
problems with the test (are return statements is missing when actual==null?), 
because these assertions only fail within the baseline/expected impl. 
Lucenebench doesn't complain anymore, even when checking scores.

Instead of re-implementing all of the previous basic implementation, do you 
think it would be possible to rather re-use a normal DisjunctionSumScorer and 
wrap it with a Scorer that discards candidates that don't match the 
mm-constraint? This way we would get correct scoring for free. E.g.
{noformat}
        public int nextDoc() throws IOException {
          int docid;
          while (true) {
            docid = wrappedScorer.nextDoc();
            if (docid==NO_MORE_DOCS)
              break;
            // count how many (optional) subscorers match on the same docid, 
and check against constraint
            int cntOpt = 0;
            for (ChildScorer cs : this.subscorers) {
              if (cs.child.docID() == docid && 
"SHOULD".equalsIgnoreCase(cs.relationship))
                cntOpt++;
            }
            if (cntOpt>=mm)
              break;
          }          
          return docid;
        }
{noformat}

I ran the perf test locally on the 1M doc index and got quite nice numbers. As 
we anticipated, the cost-API is very beneficial and the now added per-candidate 
early-stopping (less advance()-calls at the expense of additional 
if-statements) roughly doubles the speedups :)
{noformat}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
     Low4MinShouldMatch0       36.41      (3.1%)       35.18      (5.1%)   
-3.4% ( -11% -    5%)
     Low3MinShouldMatch0       21.31      (6.6%)       21.18      (3.9%)   
-0.6% ( -10% -   10%)
     Low2MinShouldMatch0       14.54      (7.4%)       14.45      (3.3%)   
-0.6% ( -10% -   10%)
     HighMinShouldMatch0        9.85      (9.1%)        9.82      (2.8%)   
-0.3% ( -11% -   12%)
     Low1MinShouldMatch0       11.68      (8.3%)       11.65      (3.1%)   
-0.3% ( -10% -   12%)
     HighMinShouldMatch2       10.12      (9.4%)       10.95      (2.8%)    
8.2% (  -3% -   22%)
     Low1MinShouldMatch2       12.08      (8.6%)       13.60      (3.2%)   
12.5% (   0% -   26%)
     Low2MinShouldMatch2       15.23      (7.8%)       18.22      (4.2%)   
19.6% (   7% -   34%)
     HighMinShouldMatch3       10.33      (9.2%)       13.21      (3.2%)   
27.9% (  14% -   44%)
     Low3MinShouldMatch2       22.72      (7.3%)       30.85      (5.8%)   
35.8% (  21% -   52%)
     Low1MinShouldMatch3       12.49      (8.3%)       17.92      (4.4%)   
43.5% (  28% -   61%)
     HighMinShouldMatch4       10.55      (9.0%)       18.29      (4.8%)   
73.3% (  54% -   95%)
     Low2MinShouldMatch3       16.08      (7.2%)       30.38      (6.8%)   
88.9% (  69% -  110%)
     Low1MinShouldMatch4       12.84      (7.9%)       32.68      (9.6%)  
154.6% ( 126% -  186%)
     Low4MinShouldMatch2       45.39      (3.2%)      222.71     (16.0%)  
390.6% ( 360% -  423%)
     Low3MinShouldMatch3       24.70      (6.2%)      215.56     (29.8%)  
772.9% ( 693% -  862%)
     Low4MinShouldMatch3       45.71      (3.1%)      636.23     (53.4%) 
1292.0% (1198% - 1390%)
     Low2MinShouldMatch4       16.48      (7.1%)      257.47     (47.6%) 
1462.7% (1314% - 1633%)
     Low4MinShouldMatch4       45.81      (3.3%)      811.82     (60.4%) 
1672.1% (1557% - 1794%)
     Low3MinShouldMatch4       24.87      (6.8%)      768.67    (119.4%) 
2990.2% (2681% - 3344%)
{noformat}
Note that these numbers are not directly comparable to the others in this 
ticket, but they give evidence that the new implementation is never (as far as 
this one query generalizes) slower than the previous one.
                
> speedup disjunction with minShouldMatch 
> ----------------------------------------
>
>                 Key: LUCENE-4571
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4571
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.1
>            Reporter: Mikhail Khludnev
>         Attachments: LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch, 
> LUCENE-4571.patch, LUCENE-4571.patch, LUCENE-4571.patch
>
>
> even minShouldMatch is supplied to DisjunctionSumScorer it enumerates whole 
> disjunction, and verifies minShouldMatch condition [on every 
> doc|https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java#L70]:
> {code}
>   public int nextDoc() throws IOException {
>     assert doc != NO_MORE_DOCS;
>     while(true) {
>       while (subScorers[0].docID() == doc) {
>         if (subScorers[0].nextDoc() != NO_MORE_DOCS) {
>           heapAdjust(0);
>         } else {
>           heapRemoveRoot();
>           if (numScorers < minimumNrMatchers) {
>             return doc = NO_MORE_DOCS;
>           }
>         }
>       }
>       afterNext();
>       if (nrMatchers >= minimumNrMatchers) {
>         break;
>       }
>     }
>     
>     return doc;
>   }
> {code}
> [~spo] proposes (as well as I get it) to pop nrMatchers-1 scorers from the 
> heap first, and then push them back advancing behind that top doc. For me the 
> question no.1 is there a performance test for minShouldMatch constrained 
> disjunction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to