[jira] [Commented] (LUCENE-8135) Implement Block-Max WAND

Adrien Grand (JIRA) Tue, 23 Jan 2018 08:57:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336062#comment-16336062
 ]


Adrien Grand commented on LUCENE-8135:
--------------------------------------

Here is a patch that applies on top of the one that is posted on LUCENE-4198:
 - Scorer gets the same methods as ImpactsEnum: advanceShallow and getMaxScore, 
with the same contract
 - Disjunctions and conjunctions skip over entire blocks when the sum of max 
scores is not competitive.
 - Disjunctions now use the block max score instead of the global max score, 
which helps skip over more documents.
 - This is not documented in the paper, but when a minimum score is set, the 
score is computed on the fly in order to move to the next doc faster. I did 
this based on the observation that computing a score was often less costly than 
advancing another clause. This is especially useful due to the fact that on the 
contrary to term queries, the maximum score of a block is often not collected.
 - Another difference with the paper is the fact that we have information about 
blocks at multiple levels. This is one of the ideas described in a follow-up 
paper (see section 4 of http://engineering.nyu.edu/~suel/papers/bmm.pdf) which 
is based on the observation that you sometimes want to advance by more than 
${block-size}.

Results on wikibigall need to be taken carefully as some tasks are either 
faster or slower depending on which query is run. Typically queries whose terms 
match lots of documents but the intersection matches few documents are a bit 
slower because it takes time to get a good lower bound for the score. Baseline 
is master, and patch is the combination of the patch on LUCENE-4198 and this 
patch.
 
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
               OrHighLow      439.34      (2.2%)      402.38      (1.8%)   
-8.4% ( -12% -   -4%)
                  Fuzzy1      144.25      (4.3%)      137.08      (3.3%)   
-5.0% ( -12% -    2%)
               MedPhrase       66.49      (1.1%)       63.83      (1.8%)   
-4.0% (  -6% -   -1%)
                  Fuzzy2      153.94      (7.0%)      147.95      (7.3%)   
-3.9% ( -16% -   11%)
   HighTermDayOfYearSort       54.35     (10.5%)       52.97     (12.0%)   
-2.5% ( -22% -   22%)
       HighTermMonthSort       91.01      (9.0%)       89.31      (9.2%)   
-1.9% ( -18% -   17%)
                 Prefix3      112.97      (7.6%)      111.66      (6.6%)   
-1.2% ( -14% -   14%)
             MedSpanNear       46.12      (4.1%)       45.82      (4.5%)   
-0.6% (  -8% -    8%)
                 Respell      249.65      (4.4%)      248.13      (4.1%)   
-0.6% (  -8% -    8%)
                Wildcard       38.32      (6.9%)       38.09      (7.2%)   
-0.6% ( -13% -   14%)
                  IntNRQ       34.26      (4.8%)       34.18      (6.0%)   
-0.3% ( -10% -   11%)
             LowSpanNear       12.46      (4.8%)       12.44      (4.9%)   
-0.2% (  -9% -   10%)
               LowPhrase       44.89      (1.4%)       44.85      (2.1%)   
-0.1% (  -3% -    3%)
         MedSloppyPhrase       13.77      (6.6%)       13.88      (5.8%)    
0.8% ( -10% -   14%)
            HighSpanNear        3.58      (7.0%)        3.61      (7.2%)    
0.9% ( -12% -   16%)
         LowSloppyPhrase        4.59      (7.7%)        4.65      (6.8%)    
1.4% ( -12% -   17%)
        HighSloppyPhrase        8.87      (8.1%)        9.02      (6.9%)    
1.7% ( -12% -   18%)
              AndHighLow     1380.92      (3.2%)     1418.90      (3.2%)    
2.8% (  -3% -    9%)
              HighPhrase       17.20      (3.3%)       18.70      (4.7%)    
8.8% (   0% -   17%)
              OrHighHigh       21.38      (3.3%)       28.96      (4.6%)   
35.5% (  26% -   44%)
              AndHighMed      123.15      (1.5%)      177.55      (5.3%)   
44.2% (  36% -   51%)
               OrHighMed       42.31      (2.0%)       68.99      (4.0%)   
63.0% (  55% -   70%)
             AndHighHigh       25.65      (1.5%)       44.33      (6.9%)   
72.8% (  63% -   82%)
                 LowTerm      567.65      (2.2%)     2305.80     (16.4%)  
306.2% ( 281% -  332%)
                 MedTerm      284.07      (2.1%)     1837.41     (26.1%)  
546.8% ( 508% -  587%)
               AndCommon       56.30      (1.3%)      437.69     (33.5%)  
677.5% ( 634% -  721%)
                HighTerm       86.50      (2.1%)      709.65     (38.8%)  
720.4% ( 665% -  777%)
                OrCommon       13.95      (3.3%)      223.83     (66.0%) 
1504.7% (1389% - 1627%)
{noformat}

For instance I had other runs that made OrHighMed slower. However the slowdown 
is usually contained and remains in the ~10%.

There are two tasks that I added: AndCommon and OrCommon. Current tasks have 
been computed rather randomly (I think!) based on the frequencies of terms, but 
user queries are a bit more likely to query terms that frequently occur 
together, so I wanted to see the impact on such queries:

{noformat}
OrCommon: united states
OrCommon: have been
OrCommon: see also
OrCommon: united kingdom
OrCommon: new york
AndCommon: +united +states
AndCommon: +have +been
AndCommon: +see +also
AndCommon: +united +kingdom
AndCommon: +new +york
{noformat}

On the contrary to some other tasks like OrHighMed, OrHighLow and AndHighLow, 
those tasks are consistently MUCH faster with the patch.

> Implement Block-Max WAND
> ------------------------
>
>                 Key: LUCENE-8135
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8135
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> This issue is about building on top of LUCENE-4198 in order to leverage block 
> maximum scores instead of global maximum scores. This is documented in 
> "Faster Top-k Document Retrieval Using Block-Max Indexes" 
> ([http://engineering.nyu.edu/~suel/papers/bmw.pdf)|http://engineering.nyu.edu/~suel/papers/bmw.pdf]
>  and called BMW (Block-Max WAND).
>  
> Using block max scores adds overhead to scorers, but also provides better 
> upper bounds of the scores and is expected to remain efficient in presence of 
> outliers (LUCENE-8087).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8135) Implement Block-Max WAND

Reply via email to