[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307226#comment-17307226
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
{quote}I wonder if it would help to make the encoding/decoding aware that not
all numbers of bits per value are equal. For instance the benchmarks
([https://github.com/jpountz/decode-128-ints-benchmark]) I ran when looking
into vectorizing decoding suggested that throughputs were highly dependent on
the number of bits per value. So maybe we could tune PFOR to never e.g. go from
16 bits per value to 15 because the savings are small while the decoding is
significantly slower.
{quote}
Yeah, interesting thought! I experimented with a similar idea to always round
up to powers of 2 bpv (i.e., 1, 2, 4, 8, 16) since the code for decoding those
bpv's appears much simpler and more optimized. I wasn't aware of your benchmark
results at the time I tried this, but it seems to generally align with your
findings (with maybe a couple exceptions). This increased our red-line
queries/sec by +2.3% but came at the cost of +9.6% index size (yikes)! Most of
the index size growth was coming from rounding up everything larger than 8 but
less than 16. When I capped the rounding to 8, the index only grew by +1% but
red-line queries/sec improvements were only +0.7%. I think you're right though,
in that there's probably some interesting work to not round everything up, but
be more precise with the bpv's we try to avoid.
{quote}Also maybe the PFOR decoding logic could still optimize the prefix sum
in the case when there are no exceptions?
{quote}
This is a great suggestion! I'll add that logic.
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]