[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Greg Miller (Jira) Tue, 23 Mar 2021 10:00:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307226#comment-17307226
 ]


Greg Miller commented on LUCENE-9850:
-------------------------------------

{quote}I wonder if it would help to make the encoding/decoding aware that not 
all numbers of bits per value are equal. For instance the benchmarks 
([https://github.com/jpountz/decode-128-ints-benchmark]) I ran when looking 
into vectorizing decoding suggested that throughputs were highly dependent on 
the number of bits per value. So maybe we could tune PFOR to never e.g. go from 
16 bits per value to 15 because the savings are small while the decoding is 
significantly slower.
{quote}
Yeah, interesting thought! I experimented with a similar idea to always round 
up to powers of 2 bpv (i.e., 1, 2, 4, 8, 16) since the code for decoding those 
bpv's appears much simpler and more optimized. I wasn't aware of your benchmark 
results at the time I tried this, but it seems to generally align with your 
findings (with maybe a couple exceptions). This increased our red-line 
queries/sec by +2.3% but came at the cost of +9.6% index size (yikes)! Most of 
the index size growth was coming from rounding up everything larger than 8 but 
less than 16. When I capped the rounding to 8, the index only grew by +1% but 
red-line queries/sec improvements were only +0.7%. I think you're right though, 
in that there's probably some interesting work to not round everything up, but 
be more precise with the bpv's we try to avoid.
{quote}Also maybe the PFOR decoding logic could still optimize the prefix sum 
in the case when there are no exceptions?
{quote}
This is a great suggestion! I'll add that logic.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

Reply via email to