[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304171#comment-17304171
 ] 

Greg Miller edited comment on LUCENE-9850 at 3/26/21, 7:22 PM:
---------------------------------------------------------------

It may also be worth noting here that the postings in the index I've run this 
analysis on don't appear to be particularly dense, and it's more difficult to 
drop a required bit using PFOR as sparseness increases. Right now, with the 
three exceptions allowed in PFOR, the 4th largest doc ID delta in a block must 
be at least half as small as the largest to shrink the bits required. You can 
see this effect in the histograms in that more of the smaller bits-per-value 
blocks in FOR are "shifting down" a bit with PFOR than in the larger ones. As 
the bits-per-value required gets larger, it appears "harder" for outliers to be 
at least twice as large.

I'd be really curious to see what kind of impact this might have on more dense 
postings. As [~jpountz] noted in our email conversation (referenced above), 
there's also a bigger relative gain to be had when shrinking bits-per-value on 
denser postings (dropping from 5 to 4 is a better relative gain than 15 to 14 
for example).


was (Author: gsmiller):
It may also be worth noting here that the postings in the index I've run this 
analysis on don't appear to be particularly dense, and it's more difficult to 
drop a required bit using PFOR as sparseness increases. Right now, with the 
three exceptions allowed in PFOR, the 4th largest doc ID delta in a block must 
be at least half as small as the largest to shrink the bits required. You can 
see this effect in the histograms in that more of the smaller bits-per-value 
blocks in FOR are "shifting down" a bit with PFOR than in the larger ones. As 
the bits-per-value required gets larger, it appears "harder" for outliers to be 
at least twice as large.

I'd be really curious to see what kind of impact this might have on more dense 
postings. As [~jpountz] noted in our email conversation (referenced above), 
there's also a bigger relative gain to be had when shrinking bits-per-value on 
denser postings (dropping from 5 to 4 is a bitter relative gain than 15 to 14 
for example).

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
>                 Key: LUCENE-9850
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9850
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to