[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Mon, 05 Apr 2021 12:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315066#comment-17315066
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

Thanks [~mikemccand]! For starters, yes—all the runs I referenced here are 
using "wikimediumall".
{quote}I would expect {{XTerm}} to show speedups since this is largely 
dominated by decoding many postings blocks.  But it is odd to see the 
{{XTermYSort}} tasks negatively impacted: those tasks are just sorting by a 
{{DocValues}} field instead of default text relevance (BM25).
{quote}
I might except the opposite actually. Anytime the PFOR approach has to apply 
exceptions, I would expect a performance hit of some sort since it has extra 
work to do on top of the FOR approach used today. So if the Term tasks are 
largely dominated by postings decoding, I would expect regressions to show up 
there more than elsewhere. Maybe I'm misunderstanding your comment though?

I re-ran wikimediumall with only "Term" tasks and got the following (looks like 
noise to me):
{code:java}
                    TaskQPS baseline      StdDevQPS pfordocids      StdDev      
          Pct diff p-value
   HighTermDayOfYearSort        5.74     (10.8%)        5.59      (9.9%)   
-2.6% ( -20% -   20%) 0.431
              TermDTSort       44.54     (15.4%)       44.06     (14.3%)   
-1.1% ( -26% -   33%) 0.816
    HighTermTitleBDVSort       30.90     (14.3%)       30.59     (13.5%)   
-1.0% ( -25% -   31%) 0.820
                 MedTerm      392.89      (6.8%)      389.45      (7.5%)   
-0.9% ( -14% -   14%) 0.699
                 LowTerm      412.80      (7.0%)      410.68      (7.9%)   
-0.5% ( -14% -   15%) 0.827
                PKLookup      130.70      (2.8%)      131.44      (2.0%)    
0.6% (  -4% -    5%) 0.470
       HighTermMonthSort       61.69     (12.2%)       62.13     (13.6%)    
0.7% ( -22% -   30%) 0.860
                HighTerm      381.73     (10.0%)      385.03      (7.8%)    
0.9% ( -15% -   20%) 0.761
{code}
I also pulled out all the tasks in my last wikimediumall run that had a 
significant change in either direction (p-value <= 0.05) and reran them alone 
to see if the results were repeatable. In general, they were not. The three 
that *were* repeatable (significant) regressions were LowSpanNear (-2.2%), 
AndHighMed (-2.1%) and AndHighHigh (-2.0%):
{code:java}
                    TaskQPS baseline      StdDevQPS pfordocids      StdDev      
          Pct diff p-value
    HighTermTitleBDVSort       42.09     (10.6%)       41.04     (10.4%)   
-2.5% ( -21% -   20%) 0.451
             LowSpanNear        4.29      (1.9%)        4.20      (1.4%)   
-2.2% (  -5% -    1%) 0.000
              AndHighMed       26.79      (3.0%)       26.23      (2.4%)   
-2.1% (  -7% -    3%) 0.014
             AndHighHigh       13.83      (3.4%)       13.54      (2.8%)   
-2.0% (  -7% -    4%) 0.037
                HighTerm      696.28      (6.7%)      688.02      (7.0%)   
-1.2% ( -13% -   13%) 0.585
           OrNotHighHigh      372.40      (5.3%)      370.23      (5.3%)   
-0.6% ( -10% -   10%) 0.726
   HighTermDayOfYearSort       13.53      (9.8%)       13.46     (10.4%)   
-0.5% ( -18% -   21%) 0.866
             MedSpanNear       19.76      (1.7%)       19.67      (1.4%)   
-0.5% (  -3% -    2%) 0.320
       HighTermMonthSort       48.94     (11.2%)       48.77     (10.5%)   
-0.3% ( -19% -   24%) 0.921
                 LowTerm      714.77      (3.0%)      714.27      (5.6%)   
-0.1% (  -8% -    8%) 0.960
                PKLookup      139.66      (3.0%)      139.69      (2.4%)    
0.0% (  -5% -    5%) 0.984
                  IntNRQ       33.49      (1.4%)       33.54      (1.2%)    
0.2% (  -2% -    2%) 0.690
{code}
Finally, I tried running a micro-benchmark to see if I could isolate how much 
regression there was due to applying the exceptions in PFOR compared to FOR. I 
forked [~jpountz]'s code originally used for optimizing FOR. The results are in 
the README over 
[here|https://github.com/gsmiller/decode-128-ints-benchmark/tree/pfor-delta] if 
interested. They generally make sense to me and show that performance does take 
a hit when there are exceptions to patch in, but I think we're seeing with the 
overall luceneutil benchmark's that those performance shifts aren't showing up 
in the bigger picture for the most part.

So overall, it seems like the index size reduction might be worth it here, but 
I'm new to Lucene benchmarks and this is my first attempt at running a 
micro-benchmark, so I would trust the opinion of others with more experienced 
eyes here.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>         Attachments: apply_exceptions.png, bulk_read_1.png, bulk_read_2.png, 
> for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to