[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Fri, 02 Apr 2021 06:39:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313877#comment-17313877
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

I haven't really been following along with what's going on in JDK17, but being 
able to more explicitly generate vectorized instructions will be nice!

I optimized my branch a bit further. At this point I think it has all the 
optimizations of ForDeltaUtil, and the only extra work is applying the 
exceptions. I even pulled in the optimization to apply the prefix to two values 
at a time (packed in a long). That code is over 
[here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3].
 (It's not polished at all... a bit hacky)

The benchmark results are looking at lot better now, but maybe still some 
regressions. I've seen a little variability in these results, so I'm not sure 
how often they might present false-regression results on individual tasks? 
Here's what I've got at this point:
{code:java}
                    TaskQPS baseline      StdDevQPS pfordocids      StdDev      
          Pct diff p-value
             LowSpanNear       98.61      (2.2%)       95.57      (1.7%)   
-3.1% (  -6% -    0%) 0.000
           OrNotHighHigh      545.01      (3.8%)      531.11      (5.3%)   
-2.6% ( -11% -    6%) 0.078
                Wildcard       40.83      (4.1%)       40.05      (3.9%)   
-1.9% (  -9% -    6%) 0.132
               OrHighMed      102.39      (2.5%)      100.50      (2.5%)   
-1.8% (  -6% -    3%) 0.021
             AndHighHigh       50.93      (3.3%)       50.03      (3.1%)   
-1.8% (  -7% -    4%) 0.079
              TermDTSort       98.42     (11.6%)       96.72     (14.4%)   
-1.7% ( -24% -   27%) 0.676
              AndHighMed       68.10      (2.9%)       66.94      (2.9%)   
-1.7% (  -7% -    4%) 0.063
                HighTerm     1169.43      (4.4%)     1151.70      (5.1%)   
-1.5% ( -10% -    8%) 0.314
   BrowseMonthSSDVFacets       12.50      (5.6%)       12.31      (7.7%)   
-1.5% ( -14% -   12%) 0.480
    HighTermTitleBDVSort      157.42     (14.7%)      155.08     (15.6%)   
-1.5% ( -27% -   33%) 0.757
            OrHighNotLow      545.83      (5.7%)      537.85      (7.0%)   
-1.5% ( -13% -   12%) 0.472
             MedSpanNear       28.75      (2.4%)       28.34      (1.9%)   
-1.4% (  -5% -    2%) 0.038
           OrHighNotHigh      533.41      (4.6%)      526.33      (5.3%)   
-1.3% ( -10% -    8%) 0.394
                  Fuzzy1       59.47      (6.0%)       58.72      (6.8%)   
-1.3% ( -13% -   12%) 0.533
            HighSpanNear       21.27      (2.6%)       21.03      (2.2%)   
-1.1% (  -5% -    3%) 0.153
       HighTermMonthSort      128.50     (12.2%)      127.11     (11.0%)   
-1.1% ( -21% -   25%) 0.769
            OrNotHighLow      640.89      (4.0%)      634.43      (3.5%)   
-1.0% (  -8% -    6%) 0.395
              OrHighHigh       21.11      (2.0%)       20.91      (1.8%)   
-1.0% (  -4% -    2%) 0.113
               MedPhrase      103.90      (3.0%)      103.05      (2.9%)   
-0.8% (  -6% -    5%) 0.381
              HighPhrase      172.59      (2.5%)      171.22      (2.5%)   
-0.8% (  -5% -    4%) 0.320
            OrHighNotMed      535.67      (4.8%)      531.54      (4.7%)   
-0.8% (  -9% -    9%) 0.607
                 LowTerm     1094.41      (2.9%)     1087.97      (3.1%)   
-0.6% (  -6% -    5%) 0.535
         MedSloppyPhrase       12.91      (2.4%)       12.85      (2.5%)   
-0.5% (  -5% -    4%) 0.542
                  IntNRQ      101.21      (0.5%)      100.81      (0.7%)   
-0.4% (  -1% -    0%) 0.040
                PKLookup      144.62      (3.0%)      144.11      (3.1%)   
-0.4% (  -6% -    5%) 0.715
        HighSloppyPhrase        3.75      (2.9%)        3.74      (3.0%)   
-0.3% (  -6% -    5%) 0.726
    HighIntervalsOrdered       16.00      (2.1%)       15.95      (1.8%)   
-0.3% (  -4% -    3%) 0.597
   HighTermDayOfYearSort      109.37     (11.0%)      109.03     (15.2%)   
-0.3% ( -23% -   29%) 0.941
         LowSloppyPhrase       41.05      (1.9%)       40.93      (2.1%)   
-0.3% (  -4% -    3%) 0.635
                 MedTerm     1137.13      (4.1%)     1134.84      (4.2%)   
-0.2% (  -8% -    8%) 0.877
BrowseDayOfYearTaxoFacets        4.24      (3.4%)        4.23      (3.2%)   
-0.2% (  -6% -    6%) 0.885
                 Prefix3      263.31      (9.1%)      263.08      (9.2%)   
-0.1% ( -16% -   20%) 0.976
    BrowseDateTaxoFacets        4.23      (3.4%)        4.23      (3.2%)   
-0.1% (  -6% -    6%) 0.941
                  Fuzzy2       44.14     (13.1%)       44.17     (14.2%)    
0.1% ( -24% -   31%) 0.990
   BrowseMonthTaxoFacets        5.00      (2.6%)        5.01      (2.2%)    
0.1% (  -4% -    4%) 0.878
            OrNotHighMed      508.22      (3.6%)      508.82      (3.9%)    
0.1% (  -7% -    7%) 0.921
                 Respell       38.75      (2.3%)       38.82      (2.2%)    
0.2% (  -4% -    4%) 0.791
BrowseDayOfYearSSDVFacets       11.38      (5.7%)       11.42      (5.8%)    
0.3% ( -10% -   12%) 0.855
               OrHighLow      220.12      (4.0%)      220.98      (3.7%)    
0.4% (  -6% -    8%) 0.745
              AndHighLow      620.94      (3.0%)      624.41      (3.3%)    
0.6% (  -5% -    7%) 0.573
               LowPhrase      132.77      (2.1%)      133.52      (2.4%)    
0.6% (  -3% -    5%) 0.433
{code}

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to