[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Thu, 25 Mar 2021 20:05:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309120#comment-17309120
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

Ok, not as bad with some more optimizations in place (thanks [~jpountz]!), but 
still a regression. Here's what I'm seeing (still with "-source wikimediumall" 
as before):
{code:java}
                     TaskQPS baseline      StdDevQPS pfor 7 exceptions      
StdDev                Pct diff p-value
              AndHighMed       40.34      (2.6%)       38.83      (2.1%)   
-3.7% (  -8% -    0%) 0.000
                 Prefix3       13.95      (1.6%)       13.55      (1.8%)   
-2.9% (  -6% -    0%) 0.000
               OrHighMed       42.52      (2.5%)       41.33      (3.7%)   
-2.8% (  -8% -    3%) 0.004
               OrHighLow      249.72      (3.8%)      242.87      (4.8%)   
-2.7% ( -10% -    6%) 0.046
              AndHighLow      320.47      (3.7%)      311.87      (4.0%)   
-2.7% ( -10% -    5%) 0.028
               LowPhrase       15.24      (2.3%)       14.88      (1.8%)   
-2.3% (  -6% -    1%) 0.000
            OrNotHighLow      459.84      (4.1%)      449.82      (4.2%)   
-2.2% ( -10% -    6%) 0.094
                 MedTerm      975.99      (4.2%)      954.87      (4.1%)   
-2.2% ( -10% -    6%) 0.101
           OrNotHighHigh      380.66      (4.2%)      372.45      (5.6%)   
-2.2% ( -11% -    7%) 0.167
            OrHighNotLow      494.46      (4.6%)      484.75      (5.8%)   
-2.0% ( -11% -    8%) 0.234
                Wildcard       57.07      (1.9%)       56.04      (1.4%)   
-1.8% (  -5% -    1%) 0.001
           OrHighNotHigh      422.27      (5.4%)      414.76      (3.8%)   
-1.8% ( -10% -    7%) 0.227
              OrHighHigh       13.69      (1.8%)       13.47      (3.5%)   
-1.6% (  -6% -    3%) 0.065
         LowSloppyPhrase       15.05      (3.5%)       14.82      (4.0%)   
-1.5% (  -8% -    6%) 0.199
                  Fuzzy2       22.48      (5.7%)       22.15      (4.7%)   
-1.5% ( -11% -    9%) 0.376
            OrNotHighMed      454.42      (4.8%)      447.81      (5.3%)   
-1.5% ( -11% -    9%) 0.362
              TermDTSort       43.90     (11.6%)       43.27     (10.4%)   
-1.4% ( -21% -   23%) 0.678
             LowSpanNear        4.39      (2.6%)        4.32      (1.9%)   
-1.4% (  -5% -    3%) 0.050
        HighSloppyPhrase        2.77      (3.2%)        2.73      (3.2%)   
-1.2% (  -7% -    5%) 0.251
   HighTermDayOfYearSort        6.33     (13.6%)        6.26     (13.3%)   
-1.0% ( -24% -   29%) 0.806
    HighIntervalsOrdered        1.08      (0.9%)        1.07      (1.2%)   
-1.0% (  -3% -    1%) 0.003
             AndHighHigh       40.58      (3.2%)       40.22      (3.1%)   
-0.9% (  -6% -    5%) 0.378
                HighTerm      792.80      (5.1%)      789.86      (5.0%)   
-0.4% (  -9% -   10%) 0.816
            OrHighNotMed      509.78      (6.4%)      508.18      (5.5%)   
-0.3% ( -11% -   12%) 0.868
             MedSpanNear        4.96      (2.1%)        4.95      (1.5%)   
-0.3% (  -3% -    3%) 0.666
               MedPhrase       81.04      (1.8%)       80.85      (3.0%)   
-0.2% (  -4% -    4%) 0.763
         MedSloppyPhrase        9.10      (3.8%)        9.08      (3.6%)   
-0.2% (  -7% -    7%) 0.851
                  IntNRQ       19.09      (0.6%)       19.05      (0.8%)   
-0.2% (  -1% -    1%) 0.367
    HighTermTitleBDVSort       34.87     (11.3%)       34.86     (13.4%)   
-0.0% ( -22% -   27%) 0.995
   BrowseMonthSSDVFacets        3.14      (1.0%)        3.14      (1.1%)    
0.0% (  -2% -    2%) 0.976
       HighTermMonthSort       18.38     (13.3%)       18.42     (18.3%)    
0.2% ( -27% -   36%) 0.969
BrowseDayOfYearSSDVFacets        2.89      (0.9%)        2.90      (1.1%)    
0.2% (  -1% -    2%) 0.492
                 LowTerm      969.07      (4.7%)      971.45      (4.1%)    
0.2% (  -8% -    9%) 0.860
            HighSpanNear        3.27      (2.1%)        3.28      (1.8%)    
0.3% (  -3% -    4%) 0.608
                 Respell       33.76      (1.2%)       33.94      (1.3%)    
0.5% (  -1% -    3%) 0.185
                PKLookup      123.25      (2.7%)      124.06      (3.2%)    
0.7% (  -5% -    6%) 0.485
              HighPhrase      218.15      (3.2%)      219.85      (3.0%)    
0.8% (  -5% -    7%) 0.428
   BrowseMonthTaxoFacets        1.39      (1.7%)        1.41      (1.7%)    
0.9% (  -2% -    4%) 0.110
    BrowseDateTaxoFacets        1.20      (2.2%)        1.22      (2.2%)    
1.1% (  -3% -    5%) 0.114
BrowseDayOfYearTaxoFacets        1.20      (2.4%)        1.21      (2.4%)    
1.2% (  -3% -    6%) 0.109
                  Fuzzy1       46.25      (8.0%)       47.43      (9.4%)    
2.6% ( -13% -   21%) 0.354
{code}
The modifications on PForUtil this was run with are 
[here|https://github.com/apache/lucene/compare/main...gsmiller:LUCENE-9850/pfordocids#diff-9f4cb4a664b2a8f0594b221368085548a58ecb1cc1290f18160b613d400fcc29].
 I'll think about whether-or-not there's maybe further opportunities to 
optimize this. There's a lot of branching in there, but I'm not sure how much 
of it is avoidable. I'll put some fresh eyes on it tomorrow.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to