[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Wed, 24 Mar 2021 15:57:25 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308216#comment-17308216
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

I ran a luceneutil benchmark comparing my PFOR approach to encoding doc ID 
deltas (available 
[here|https://github.com/gsmiller/lucene/tree/LUCENE-9850/pfordocids]) to the 
main branch. Here are the results. This is the first luceneutil benchmark I've 
run, so I'm still getting familiar with the tool and interpreting results. This 
was run with the "wikimediumall" source. If I'm interpreting these results 
correctly, it looks like there is a pretty material performance penalty to 
using PFOR instead of FOR, but I'd be curious what other, more experienced 
folks see in these results. I'll see if I can get some figures on the index 
size difference as well, but I'm not sure there's a good path forward here with 
these QPS results.
{code:java}
                    TaskQPS baseline      StdDevQPS pfor doc ids      StdDev    
            Pct diff p-value
              TermDTSort       38.02     (11.8%)       36.08      (8.9%)   
-5.1% ( -23% -   17%) 0.123
            OrNotHighLow      488.43      (5.8%)      466.01      (6.2%)   
-4.6% ( -15% -    7%) 0.016
                HighTerm     1276.94      (5.0%)     1222.31      (5.3%)   
-4.3% ( -13% -    6%) 0.009
   HighTermDayOfYearSort       51.64     (11.6%)       49.66      (8.0%)   
-3.8% ( -20% -   17%) 0.223
       HighTermMonthSort       59.36     (10.6%)       57.09     (11.4%)   
-3.8% ( -23% -   20%) 0.272
    HighTermTitleBDVSort       36.61     (16.4%)       35.27     (19.2%)   
-3.7% ( -33% -   38%) 0.517
             AndHighHigh       11.06      (3.7%)       10.67      (3.0%)   
-3.5% (  -9% -    3%) 0.001
           OrHighNotHigh      568.03     (10.4%)      548.46      (7.7%)   
-3.4% ( -19% -   16%) 0.233
               OrHighLow      261.36      (3.9%)      252.58      (3.7%)   
-3.4% ( -10% -    4%) 0.005
              AndHighMed       82.45      (3.1%)       79.71      (3.1%)   
-3.3% (  -9% -    2%) 0.001
               MedPhrase       40.33      (5.4%)       39.02      (4.7%)   
-3.2% ( -12% -    7%) 0.043
                Wildcard       25.19      (2.8%)       24.46      (2.7%)   
-2.9% (  -8% -    2%) 0.001
             LowSpanNear        5.52      (2.0%)        5.36      (2.3%)   
-2.9% (  -6% -    1%) 0.000
              AndHighLow      203.23      (2.9%)      197.52      (2.6%)   
-2.8% (  -8% -    2%) 0.001
               OrHighMed       19.99      (2.0%)       19.43      (2.1%)   
-2.8% (  -6% -    1%) 0.000
                 MedTerm      829.73      (6.4%)      807.65      (5.1%)   
-2.7% ( -13% -    9%) 0.144
            OrHighNotLow      482.63      (4.8%)      469.91      (5.5%)   
-2.6% ( -12% -    8%) 0.105
              OrHighHigh        9.20      (2.0%)        8.97      (2.3%)   
-2.5% (  -6% -    1%) 0.000
               LowPhrase       16.16      (3.3%)       15.76      (2.7%)   
-2.5% (  -8% -    3%) 0.009
             MedSpanNear        3.14      (2.1%)        3.07      (2.3%)   
-2.3% (  -6% -    2%) 0.001
                 Prefix3      121.86      (8.5%)      119.12      (6.5%)   
-2.2% ( -15% -   13%) 0.349
            OrNotHighMed      477.93      (6.0%)      467.27      (6.7%)   
-2.2% ( -14% -   11%) 0.268
            HighSpanNear        9.24      (2.2%)        9.05      (2.1%)   
-2.0% (  -6% -    2%) 0.004
         MedSloppyPhrase       16.95      (2.9%)       16.67      (3.0%)   
-1.7% (  -7% -    4%) 0.069
                  IntNRQ       49.47      (2.6%)       48.88      (1.6%)   
-1.2% (  -5% -    3%) 0.087
         LowSloppyPhrase       30.67      (2.7%)       30.33      (2.8%)   
-1.1% (  -6% -    4%) 0.198
                 LowTerm      984.89      (4.7%)      973.96      (3.1%)   
-1.1% (  -8% -    7%) 0.380
           OrNotHighHigh      476.25      (8.3%)      471.56      (7.9%)   
-1.0% ( -15% -   16%) 0.701
    HighIntervalsOrdered        4.20      (2.2%)        4.18      (2.4%)   
-0.7% (  -5% -    3%) 0.347
            OrHighNotMed      445.69      (5.1%)      443.29      (5.6%)   
-0.5% ( -10% -   10%) 0.750
   BrowseMonthTaxoFacets        1.41      (1.2%)        1.41      (1.4%)   
-0.3% (  -2% -    2%) 0.427
                PKLookup      127.78      (3.2%)      127.46      (2.9%)   
-0.3% (  -6% -    6%) 0.794
BrowseDayOfYearTaxoFacets        1.22      (2.1%)        1.22      (2.1%)   
-0.2% (  -4% -    4%) 0.735
    BrowseDateTaxoFacets        1.23      (2.0%)        1.23      (2.0%)   
-0.2% (  -4% -    3%) 0.782
BrowseDayOfYearSSDVFacets        2.90      (0.9%)        2.90      (1.0%)   
-0.1% (  -1% -    1%) 0.685
   BrowseMonthSSDVFacets        3.15      (1.0%)        3.15      (1.1%)    
0.0% (  -2% -    2%) 0.903
        HighSloppyPhrase        7.89      (5.5%)        7.91      (4.4%)    
0.2% (  -9% -   10%) 0.876
                  Fuzzy2       34.20      (9.0%)       34.31      (8.4%)    
0.3% ( -15% -   19%) 0.909
                  Fuzzy1       44.78      (6.2%)       44.95      (6.0%)    
0.4% ( -11% -   13%) 0.851
                 Respell       21.07      (2.5%)       21.16      (2.4%)    
0.5% (  -4% -    5%) 0.552
              HighPhrase      274.21      (5.3%)      279.29      (5.3%)    
1.9% (  -8% -   13%) 0.269
{code}

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to