[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

Adrien Grand (JIRA) Fri, 05 Jan 2018 06:26:24 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-4198:
---------------------------------
    Attachment: LUCENE-4198.patch

New patch. This time it has tests, does basic testing in CheckIndex and does 
not clone too much.

Results are very good on queries that score on a single term, almost too good, 
I'm currently thinking about how we could change the API to have something that 
is easier to propagate with boolean queries, even if it means term queries 
can't be as fast.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
              AndHighLow     2050.37      (4.2%)     1745.54      (2.0%)  
-14.9% ( -20% -   -9%)
               OrHighLow      922.62      (3.7%)      844.54      (2.4%)   
-8.5% ( -14% -   -2%)
              AndHighMed      277.85      (1.8%)      258.11      (2.6%)   
-7.1% ( -11% -   -2%)
            OrNotHighLow     1105.41      (3.6%)     1044.69      (2.0%)   
-5.5% ( -10% -    0%)
             AndHighHigh      128.97      (1.1%)      121.89      (2.7%)   
-5.5% (  -9% -   -1%)
                  Fuzzy2      166.62      (6.2%)      158.38      (6.3%)   
-4.9% ( -16% -    8%)
               OrHighMed      177.56      (2.3%)      170.05      (1.9%)   
-4.2% (  -8% -    0%)
                  Fuzzy1      199.16      (4.4%)      193.05      (5.5%)   
-3.1% ( -12% -    7%)
         MedSloppyPhrase       53.92      (2.2%)       52.40      (2.3%)   
-2.8% (  -7% -    1%)
               LowPhrase      201.13      (1.7%)      195.87      (1.0%)   
-2.6% (  -5% -    0%)
             LowSpanNear      363.85      (3.0%)      355.07      (2.5%)   
-2.4% (  -7% -    3%)
              HighPhrase       62.68      (1.6%)       61.32      (1.2%)   
-2.2% (  -4% -    0%)
       HighTermMonthSort      218.42      (9.8%)      214.35      (8.3%)   
-1.9% ( -18% -   18%)
             MedSpanNear       46.65      (1.4%)       45.89      (1.5%)   
-1.6% (  -4% -    1%)
               MedPhrase      178.02      (1.5%)      175.24      (1.2%)   
-1.6% (  -4% -    1%)
            HighSpanNear       10.21      (3.4%)       10.11      (3.4%)   
-1.0% (  -7% -    6%)
        HighSloppyPhrase       32.32      (7.3%)       32.01      (7.1%)   
-1.0% ( -14% -   14%)
         LowSloppyPhrase       18.01      (2.7%)       17.85      (2.7%)   
-0.9% (  -6% -    4%)
                 Respell      320.99      (2.1%)      321.02      (2.4%)    
0.0% (  -4% -    4%)
                  IntNRQ       29.29     (11.6%)       29.42     (12.5%)    
0.4% ( -21% -   27%)
                Wildcard      189.97      (4.6%)      191.87      (3.9%)    
1.0% (  -7% -    9%)
                 Prefix3      166.43      (6.2%)      169.95      (5.4%)    
2.1% (  -8% -   14%)
              OrHighHigh       48.00      (3.7%)       49.09      (3.9%)    
2.3% (  -5% -   10%)
   HighTermDayOfYearSort      146.88      (7.4%)      150.76      (8.0%)    
2.6% ( -11% -   19%)
                 LowTerm      830.79      (2.6%)     2246.40      (9.9%)  
170.4% ( 153% -  187%)
            OrNotHighMed      180.11      (1.5%)     1454.55     (15.7%)  
707.6% ( 680% -  735%)
                 MedTerm      216.16      (1.7%)     3834.73     (37.0%) 
1674.0% (1608% - 1742%)
                HighTerm      109.49      (2.0%)     1944.44     (45.3%) 
1675.9% (1597% - 1757%)
            OrHighNotMed       57.55      (1.1%)     1292.66     (57.7%) 
2146.2% (2064% - 2229%)
            OrHighNotLow       84.00      (1.1%)     1996.82     (75.4%) 
2277.2% (2176% - 2379%)
           OrNotHighHigh       58.22      (1.3%)     1479.53     (53.5%) 
2441.4% (2356% - 2528%)
           OrHighNotHigh       66.91      (1.2%)     2042.54     (55.1%) 
2952.6% (2862% - 3045%)
{noformat}

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to