[jira] [Updated] (LUCENE-7641) Speed up point ranges that match most documents

Adrien Grand (JIRA) Wed, 18 Jan 2017 02:35:48 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-7641:
---------------------------------
    Attachment: LUCENE-7641.patch

Here is a patch that does not compute the inverse cost and folds in the two 
ideas that I mentioned above to make point count estimation more accurate.

I also ran a benchmark with luceneutil10m:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  Fuzzy1      190.46      (7.7%)      176.23     (12.6%)   
-7.5% ( -25% -   13%)
                 Respell      277.90      (3.6%)      275.11      (3.0%)   
-1.0% (  -7% -    5%)
             MedSpanNear      150.45      (2.5%)      149.27      (3.1%)   
-0.8% (  -6% -    4%)
                Wildcard      112.61      (6.4%)      111.76      (5.9%)   
-0.8% ( -12% -   12%)
           OrNotHighHigh       60.90      (3.6%)       60.51      (4.3%)   
-0.6% (  -8% -    7%)
              AndHighLow     1168.30      (3.9%)     1160.93      (3.8%)   
-0.6% (  -8% -    7%)
               OrHighMed       41.35      (5.1%)       41.16      (5.2%)   
-0.4% ( -10% -   10%)
            OrHighNotMed       69.18      (3.9%)       68.94      (4.1%)   
-0.3% (  -8% -    8%)
           OrHighNotHigh       47.52      (3.8%)       47.39      (4.1%)   
-0.3% (  -7% -    7%)
                 LowTerm      691.05      (4.1%)      689.34      (4.5%)   
-0.2% (  -8% -    8%)
        HighSloppyPhrase       20.35      (3.1%)       20.31      (3.1%)   
-0.2% (  -6% -    6%)
            HighSpanNear       27.30      (2.0%)       27.25      (1.8%)   
-0.2% (  -3% -    3%)
               LowPhrase      101.02      (2.2%)      100.92      (2.8%)   
-0.1% (  -4% -    4%)
                  Fuzzy2       68.31      (5.8%)       68.25      (7.1%)   
-0.1% ( -12% -   13%)
             LowSpanNear      170.99      (2.7%)      170.87      (3.4%)   
-0.1% (  -6% -    6%)
         LowSloppyPhrase       55.34      (1.9%)       55.30      (2.1%)   
-0.1% (  -4% -    4%)
               OrHighLow       84.81      (3.8%)       84.81      (4.1%)    
0.0% (  -7% -    8%)
              AndHighMed      294.21      (2.3%)      294.28      (2.5%)    
0.0% (  -4% -    4%)
                 Prefix3       32.27      (8.8%)       32.28      (8.0%)    
0.0% ( -15% -   18%)
   HighTermDayOfYearSort       65.88      (2.4%)       65.92      (2.8%)    
0.1% (  -5% -    5%)
              HighPhrase       16.67      (2.1%)       16.70      (1.8%)    
0.1% (  -3% -    4%)
               MedPhrase       82.73      (3.2%)       82.86      (3.0%)    
0.2% (  -5% -    6%)
         MedSloppyPhrase       42.22      (2.1%)       42.29      (2.3%)    
0.2% (  -4% -    4%)
            OrNotHighLow      889.62      (3.3%)      891.74      (4.9%)    
0.2% (  -7% -    8%)
              OrHighHigh       32.35      (5.2%)       32.43      (5.2%)    
0.3% (  -9% -   11%)
            OrNotHighMed      158.41      (3.3%)      158.83      (3.6%)    
0.3% (  -6% -    7%)
             AndHighHigh       47.84      (1.4%)       47.98      (1.8%)    
0.3% (  -2% -    3%)
                 MedTerm      240.59      (4.1%)      241.51      (3.6%)    
0.4% (  -7% -    8%)
            OrHighNotLow      111.19      (4.7%)      111.65      (5.3%)    
0.4% (  -9% -   10%)
                HighTerm      100.65      (4.3%)      101.63      (4.6%)    
1.0% (  -7% -   10%)
       HighTermMonthSort      140.07     (11.1%)      141.91     (10.5%)    
1.3% ( -18% -   25%)
                IntNRQ10       76.64      (9.9%)       80.71      (2.8%)    
5.3% (  -6% -   19%)
                IntNRQ50       17.16     (11.4%)       18.11      (2.7%)    
5.6% (  -7% -   22%)
                IntNRQ25       33.52     (11.0%)       35.61      (3.0%)    
6.2% (  -6% -   22%)
                  IntNRQ       12.62     (11.8%)       15.84      (4.4%)   
25.6% (   8% -   47%)
                IntNRQ75       11.57     (11.7%)       15.18      (4.7%)   
31.2% (  13% -   54%)
                IntNRQ90        9.71     (11.6%)       13.84      (5.6%)   
42.6% (  22% -   67%)
{noformat}

This looks like a good speedup to me.

The ranges that are followed by a number are some additional ranges that I 
added in order to see the impact based on the percentage of matched documents. 
For instance {{IntNRQ75}} has ranges that match 75% of documents. Here are the 
tasks that I appended to {{wikimedium.10M.nostopwords.tasks}} if you would like 
to reproduce:

{noformat}
IntNRQ10: nrq//timesecnum 6207 15563
IntNRQ10: nrq//timesecnum 53 8742
IntNRQ10: nrq//timesecnum 77160 84811
IntNRQ25: nrq//timesecnum 8899 36335
IntNRQ25: nrq//timesecnum 68109 86101
IntNRQ25: nrq//timesecnum 44771 64123
IntNRQ50: nrq//timesecnum 101 51223
IntNRQ50: nrq//timesecnum 25811 69953
IntNRQ50: nrq//timesecnum 46298 82710
IntNRQ75: nrq//timesecnum 13867 79886
IntNRQ75: nrq//timesecnum 8830 75922
IntNRQ75: nrq//timesecnum 17003 82111
IntNRQ90: nrq//timesecnum 1810 80159
IntNRQ90: nrq//timesecnum 6160 84811
IntNRQ90: nrq//timesecnum 8210 86330
{noformat}

It is interesting that small ranges seem to also benefit from a small speedup 
(I reran the benchmark to confirm and it looks reproducible). I am unsure why, 
but suspect that the fact that we use a different code path for large ranges 
makes things easier for the JVM to optimize on small ranges (?).

> Speed up point ranges that match most documents
> -----------------------------------------------
>
>                 Key: LUCENE-7641
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7641
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7461.patch, LUCENE-7641.patch
>
>
> If a point range matches most documents and  every document has exactly one 
> value, then we could make things faster by computing the set of documents 
> that do NOT match the range instead.
> It was not possible until recently since figuring out whether a range query 
> matches most documents was not possible, but we can now use the new 
> {{PointValues.estimatePointcount}} API to do that: we could just check 
> whether the cost of the inverse visitor is lower than the cost of the regular 
> range visitor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7641) Speed up point ranges that match most documents

Reply via email to