[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Wed, 31 Mar 2021 17:30:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312792#comment-17312792
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

I gave this benchmark another run now that PFOR has been updated from 3 
allowable exceptions to 7. As expected, the index size reduction is further 
improved, but the QPS regressions appear to get worse. Here's what I see:

Note: Still using "-source wikimediumall" (wikimedium.10M.nostopwords.tasks).

The doc ID payload portion of the index is reduced 11.9% (~3.3GB -> ~2.9GB). 
The overall index is reduced 3.3% (~11.6GB -> ~11.2GB).
{code:java}
BASELINE

DOC ID BPV
 0 ****                                                [6.86 pct] (1529467 of 
22287406)
 1 *                                                   [0.00 pct] (91 of 
22287406)
 2 *                                                   [0.60 pct] (133848 of 
22287406)
 3 **                                                  [2.09 pct] (466022 of 
22287406)
 4 **                                                  [3.06 pct] (683006 of 
22287406)
 5 ***                                                 [4.44 pct] (990644 of 
22287406)
 6 ***                                                 [5.86 pct] (1305537 of 
22287406)
 7 *****                                               [8.38 pct] (1867660 of 
22287406)
 8 *****                                               [9.92 pct] (2211136 of 
22287406)
 9 ******                                              [10.79 pct] (2405504 of 
22287406)
10 *****                                               [9.77 pct] (2178356 of 
22287406)
11 *****                                               [8.61 pct] (1919968 of 
22287406)
12 ****                                                [7.63 pct] (1701251 of 
22287406)
13 ****                                                [6.40 pct] (1426872 of 
22287406)
14 ***                                                 [4.94 pct] (1101624 of 
22287406)
15 **                                                  [3.62 pct] (806380 of 
22287406)
16 **                                                  [2.62 pct] (583235 of 
22287406)
17 *                                                   [1.83 pct] (407402 of 
22287406)
18 *                                                   [1.28 pct] (285690 of 
22287406)
19 *                                                   [0.78 pct] (172866 of 
22287406)
20 *                                                   [0.27 pct] (59108 of 
22287406)
21 *                                                   [0.12 pct] (26582 of 
22287406)
22 *                                                   [0.08 pct] (17481 of 
22287406)
23 *                                                   [0.03 pct] (7676 of 
22287406)
24                                                     [0.00 pct] (0 of 
22287406)
25                                                     [0.00 pct] (0 of 
22287406)
26                                                     [0.00 pct] (0 of 
22287406)
27                                                     [0.00 pct] (0 of 
22287406)
28                                                     [0.00 pct] (0 of 
22287406)
29                                                     [0.00 pct] (0 of 
22287406)
30                                                     [0.00 pct] (0 of 
22287406)
31                                                     [0.00 pct] (0 of 
22287406)
Total bytes used: 3295496560

NEW CANDIDATE (PFOR doc IDs with 7 exceptions)

DOC ID BPV
 0 ****                                                [7.07 pct] (1576532 of 
22287406)
 1 *                                                   [1.44 pct] (321744 of 
22287406)
 2 **                                                  [3.74 pct] (834608 of 
22287406)
 3 ***                                                 [4.58 pct] (1019776 of 
22287406)
 4 ***                                                 [5.70 pct] (1271157 of 
22287406)
 5 ****                                                [6.56 pct] (1463046 of 
22287406)
 6 *****                                               [9.28 pct] (2068438 of 
22287406)
 7 *****                                               [9.71 pct] (2163462 of 
22287406)
 8 *****                                               [9.41 pct] (2097645 of 
22287406)
 9 *****                                               [8.58 pct] (1911927 of 
22287406)
10 *****                                               [8.08 pct] (1801505 of 
22287406)
11 ****                                                [6.92 pct] (1542164 of 
22287406)
12 ***                                                 [5.52 pct] (1231201 of 
22287406)
13 ***                                                 [4.30 pct] (957713 of 
22287406)
14 **                                                  [3.37 pct] (750159 of 
22287406)
15 **                                                  [2.38 pct] (531051 of 
22287406)
16 *                                                   [1.65 pct] (367735 of 
22287406)
17 *                                                   [1.15 pct] (255594 of 
22287406)
18 *                                                   [0.52 pct] (116752 of 
22287406)
19 *                                                   [0.02 pct] (5197 of 
22287406)
20                                                     [0.00 pct] (0 of 
22287406)
21                                                     [0.00 pct] (0 of 
22287406)
22                                                     [0.00 pct] (0 of 
22287406)
23                                                     [0.00 pct] (0 of 
22287406)
24                                                     [0.00 pct] (0 of 
22287406)
25                                                     [0.00 pct] (0 of 
22287406)
26                                                     [0.00 pct] (0 of 
22287406)
27                                                     [0.00 pct] (0 of 
22287406)
28                                                     [0.00 pct] (0 of 
22287406)
29                                                     [0.00 pct] (0 of 
22287406)
30                                                     [0.00 pct] (0 of 
22287406)
31                                                     [0.00 pct] (0 of 
22287406)
Total bytes used: 2904198119
{code}
QPS regressions as follows:
{code:java}
                    TaskQPS baseline      StdDevQPS pfordocids      StdDev      
          Pct diff p-value
                 Prefix3      163.80     (13.3%)      145.05      (8.8%)  
-11.4% ( -29% -   12%) 0.001
              AndHighMed       55.87      (4.5%)       51.35      (2.6%)   
-8.1% ( -14% -    0%) 0.000
             LowSpanNear        8.15      (1.8%)        7.69      (1.8%)   
-5.6% (  -8% -   -2%) 0.000
            OrNotHighMed      511.04      (7.0%)      484.78      (5.0%)   
-5.1% ( -16% -    7%) 0.008
              AndHighLow      295.02      (3.5%)      279.93      (3.1%)   
-5.1% ( -11% -    1%) 0.000
            OrNotHighLow      516.68      (6.4%)      491.41      (4.7%)   
-4.9% ( -15% -    6%) 0.006
            HighSpanNear       12.33      (2.0%)       11.74      (1.6%)   
-4.7% (  -8% -   -1%) 0.000
           OrNotHighHigh      398.33      (6.7%)      381.31      (6.8%)   
-4.3% ( -16% -    9%) 0.046
             MedSpanNear        7.42      (2.0%)        7.14      (2.2%)   
-3.8% (  -7% -    0%) 0.000
                Wildcard      148.87     (11.6%)      143.67     (10.4%)   
-3.5% ( -22% -   20%) 0.315
       HighTermMonthSort       35.65     (15.2%)       34.48     (12.2%)   
-3.3% ( -26% -   28%) 0.454
             AndHighHigh       17.88      (2.7%)       17.32      (2.9%)   
-3.1% (  -8% -    2%) 0.000
               MedPhrase       11.16      (4.2%)       10.83      (3.2%)   
-3.0% (  -9% -    4%) 0.013
              TermDTSort       41.80     (14.5%)       40.67     (12.0%)   
-2.7% ( -25% -   27%) 0.522
               LowPhrase       38.27      (5.0%)       37.26      (4.5%)   
-2.6% ( -11% -    7%) 0.082
           OrHighNotHigh      553.81      (8.1%)      541.01      (7.5%)   
-2.3% ( -16% -   14%) 0.347
         MedSloppyPhrase        7.30      (2.1%)        7.14      (3.1%)   
-2.2% (  -7% -    3%) 0.008
               OrHighMed       40.15      (3.4%)       39.27      (2.8%)   
-2.2% (  -8% -    4%) 0.027
              OrHighHigh        7.29      (2.6%)        7.13      (2.8%)   
-2.2% (  -7% -    3%) 0.011
               OrHighLow      166.87      (5.1%)      163.23      (4.3%)   
-2.2% ( -11% -    7%) 0.145
    HighTermTitleBDVSort       18.65     (10.8%)       18.25     (12.6%)   
-2.1% ( -22% -   23%) 0.569
   HighTermDayOfYearSort       30.11     (12.3%)       29.57     (11.8%)   
-1.8% ( -22% -   25%) 0.641
        HighSloppyPhrase        5.12      (2.4%)        5.03      (3.6%)   
-1.7% (  -7% -    4%) 0.079
              HighPhrase      113.33      (6.4%)      111.61      (6.0%)   
-1.5% ( -13% -   11%) 0.437
                  Fuzzy2       36.81      (7.1%)       36.33      (7.9%)   
-1.3% ( -15% -   14%) 0.584
    HighIntervalsOrdered        8.65      (1.5%)        8.54      (1.7%)   
-1.2% (  -4% -    2%) 0.016
         LowSloppyPhrase       68.84      (1.7%)       68.10      (2.4%)   
-1.1% (  -5% -    3%) 0.101
            OrHighNotLow      517.95      (8.7%)      514.80      (7.0%)   
-0.6% ( -15% -   16%) 0.807
                 MedTerm      907.88      (6.4%)      902.40      (7.2%)   
-0.6% ( -13% -   13%) 0.779
                 Respell       31.17      (2.8%)       31.10      (2.7%)   
-0.2% (  -5% -    5%) 0.785
   BrowseMonthSSDVFacets        3.15      (1.6%)        3.14      (1.1%)   
-0.2% (  -2% -    2%) 0.662
   BrowseMonthTaxoFacets        1.40      (1.7%)        1.40      (1.2%)    
0.2% (  -2% -    3%) 0.694
                HighTerm      713.93      (4.2%)      715.28      (4.6%)    
0.2% (  -8% -    9%) 0.893
            OrHighNotMed      445.75      (7.5%)      446.79      (9.4%)    
0.2% ( -15% -   18%) 0.931
BrowseDayOfYearTaxoFacets        1.21      (2.7%)        1.21      (2.6%)    
0.3% (  -4% -    5%) 0.761
    BrowseDateTaxoFacets        1.21      (2.5%)        1.22      (2.4%)    
0.3% (  -4% -    5%) 0.710
BrowseDayOfYearSSDVFacets        2.89      (1.5%)        2.90      (1.1%)    
0.4% (  -2% -    2%) 0.352
                PKLookup      128.08      (5.8%)      128.60      (5.7%)    
0.4% ( -10% -   12%) 0.822
                  IntNRQ       15.95     (18.3%)       16.05     (18.9%)    
0.6% ( -30% -   46%) 0.914
                 LowTerm      906.79      (5.0%)      925.27      (4.9%)    
2.0% (  -7% -   12%) 0.193
                  Fuzzy1       38.59      (6.4%)       39.40      (6.5%)    
2.1% ( -10% -   16%) 0.301
{code}
 

I'd love to find a way to cut down on this QPS regression since there's a 
decent index size reduction to be had here. I'll have to see if I can figure 
out any way to further optimize this.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to