[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

Adrien Grand (JIRA) Fri, 12 Jan 2018 04:15:15 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-4198:
---------------------------------
    Attachment: LUCENE-4198.patch

I have taken another approach. Issue with {{setMinCompetitiveScore}} is that it 
usually cannot be efficiently leveraged to speed up eg. conjunctions. So I went 
with implementing ideas from the block-max WAND (BMW) paper 
(http://engineering.nyu.edu/~suel/papers/bmw.pdf): the patch introduces a new 
{{ImpactsEnum}} which extends {{PostingsEnum}} and introduces two APIs instead 
of {{setMinCompetitiveScore}}:
 - {{int advanceShallow(int target)}} to get scoring information for documents 
that start at {{target}}. The benefit compared to {{advance}} is that it only 
advances the skip list reader, which is much cheaper: no decoding is happening.
 - {{float getMaxScore(int upTo)}} wich gives information about scores for doc 
ids between the last target to {{advanceShallow}} and {{upTo}}, both included.

Currently only TermScorer leverages this, but the benefit is that we could add 
these APIs to Scorer as well in a follow-up issue so that WANDScorer and 
ConjunctionScorer could leverage them. I built a prototype already to make sure 
that there is an actual speedup for some queries, but I'm leaving it to a 
follow-up issue as indexing impacts is already challenging on its own. One 
thing that it made me change though is that the new patch also stores all 
impacts on the first level, which is written every 128 documents. This seemed 
important for conjunctions, since the maximum score on a given block is not 
always reached, on the contrary to term queries since they match all documents 
in the block. It makes it more important to have good bounds of the score with 
conjunctions than it is with term queries. The disk overhead is still 
acceptable to me: the wikimedium10 index is only 1.4% larger overall, and 
postings alone (the .doc file) is only 3.1% larger.

Here are the benchmark results:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
              AndHighLow     1128.91      (3.5%)      875.48      (2.3%)  
-22.4% ( -27% -  -17%)
              AndHighMed      409.67      (2.0%)      331.98      (1.7%)  
-19.0% ( -22% -  -15%)
               OrHighMed      264.99      (3.5%)      229.15      (3.0%)  
-13.5% ( -19% -   -7%)
               OrHighLow      111.47      (4.5%)       98.00      (3.2%)  
-12.1% ( -18% -   -4%)
              OrHighHigh       34.88      (4.2%)       31.69      (4.0%)   
-9.1% ( -16% -   -1%)
            OrNotHighLow     1373.74      (5.2%)     1291.72      (4.1%)   
-6.0% ( -14% -    3%)
               LowPhrase       78.14      (1.6%)       75.28      (1.2%)   
-3.7% (  -6% -    0%)
               MedPhrase       47.49      (1.6%)       45.92      (1.2%)   
-3.3% (  -5% -    0%)
         LowSloppyPhrase      208.43      (2.8%)      202.37      (2.7%)   
-2.9% (  -8% -    2%)
                  Fuzzy1      300.99      (7.7%)      292.78      (8.0%)   
-2.7% ( -17% -   13%)
             LowSpanNear       62.73      (1.4%)       61.09      (1.3%)   
-2.6% (  -5% -    0%)
                  Fuzzy2      188.37      (7.9%)      184.16      (6.7%)   
-2.2% ( -15% -   13%)
             MedSpanNear       57.41      (1.8%)       56.17      (1.5%)   
-2.2% (  -5% -    1%)
         MedSloppyPhrase       23.21      (2.3%)       22.73      (2.3%)   
-2.1% (  -6% -    2%)
              HighPhrase       48.75      (3.2%)       47.80      (3.6%)   
-1.9% (  -8% -    4%)
            HighSpanNear       40.04      (2.9%)       39.35      (2.7%)   
-1.7% (  -7% -    4%)
       HighTermMonthSort      228.21      (8.4%)      224.66      (7.9%)   
-1.6% ( -16% -   16%)
        HighSloppyPhrase       25.96      (2.8%)       25.61      (3.0%)   
-1.4% (  -6% -    4%)
                 Respell      284.85      (3.7%)      282.42      (4.0%)   
-0.9% (  -8% -    7%)
                  IntNRQ       18.87      (5.3%)       18.86      (6.8%)   
-0.1% ( -11% -   12%)
                Wildcard       85.50      (5.0%)       86.79      (4.0%)    
1.5% (  -7% -   11%)
                 Prefix3      137.41      (6.5%)      141.61      (4.9%)    
3.1% (  -7% -   15%)
   HighTermDayOfYearSort      116.58      (6.3%)      121.38      (7.2%)    
4.1% (  -8% -   18%)
             AndHighHigh       37.64      (1.5%)      118.12      (6.7%)  
213.8% ( 202% -  225%)
                 LowTerm      909.13      (2.2%)     3379.38     (11.2%)  
271.7% ( 252% -  291%)
            OrNotHighMed      196.21      (1.7%)     1509.92     (28.9%)  
669.6% ( 627% -  712%)
                 MedTerm      305.82      (1.7%)     2897.01     (42.5%)  
847.3% ( 789% -  907%)
                HighTerm      108.94      (1.7%)     1191.54     (61.3%)  
993.8% ( 915% - 1075%)
            OrHighNotMed       81.83      (0.5%)     1082.94     (63.2%) 
1223.5% (1153% - 1294%)
           OrHighNotHigh       63.16      (0.7%)      995.35     (72.0%) 
1475.8% (1393% - 1559%)
            OrHighNotLow       34.53      (0.9%)      557.35     (94.8%) 
1514.1% (1406% - 1623%)
           OrNotHighHigh       51.95      (0.9%)      943.81     (73.7%) 
1716.6% (1626% - 1808%)
{noformat}

Results are a bit less good than previously for two reasons:
 - The new API doesn't allow term queries to skip more than one block (128 
docs) at a time. But I think the perspective of speeding up boolean queries too 
makes it a good trade-off, especially given that this is already quite a 
significant speed up for term queries.
 - The new implementation is more complex since it also (optionally) gives 
access to positions, offsets and payloads. I did not specialize the case that 
only frequencies are requested (hopefully this won't be necessary). Information 
about positions could be useful in the future to speed up phrase queries.

I wouldn't worry too much about the slowdown for some boolean queries. First 
the most affected queries are queries that are already quite fast. Plus I hope 
that follow-ups will allow us to get some performance back.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to