[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307210#comment-17307210
]
Michael McCandless commented on LUCENE-9850:
--------------------------------------------
Here's the result of {{bpv-tool-only}} on Lucene nightly benchmarks (EN
wikipedia) index:
{noformat}
DOC ID BPV
0 **** [7.93 pct] (1466075 of
18484892)
1 * [0.00 pct] (165 of
18484892)
2 * [0.58 pct] (106653 of
18484892)
3 ** [2.17 pct] (400444 of
18484892)
4 ** [3.16 pct] (584748 of
18484892)
5 *** [4.51 pct] (833082 of
18484892)
6 *** [5.86 pct] (1082974 of
18484892)
7 ***** [8.45 pct] (1561144 of
18484892)
8 ***** [9.93 pct] (1835188 of
18484892)
9 ****** [10.66 pct] (1970466 of
18484892)
10 ***** [9.68 pct] (1788853 of
18484892)
11 ***** [8.62 pct] (1594306 of
18484892)
12 **** [7.62 pct] (1409009 of
18484892)
13 **** [6.23 pct] (1151456 of
18484892)
14 *** [4.72 pct] (872013 of
18484892)
15 ** [3.46 pct] (640401 of
18484892)
16 ** [2.52 pct] (466228 of
18484892)
17 * [1.73 pct] (320292 of
18484892)
18 * [1.19 pct] (220389 of
18484892)
19 * [0.62 pct] (114238 of
18484892)
20 * [0.21 pct] (38229 of
18484892)
21 * [0.09 pct] (16846 of
18484892)
22 * [0.05 pct] (9250 of
18484892)
23 * [0.01 pct] (2443 of
18484892)
24 [0.00 pct] (0 of
18484892)
25 [0.00 pct] (0 of
18484892)
26 [0.00 pct] (0 of
18484892)
27 [0.00 pct] (0 of
18484892)
28 [0.00 pct] (0 of
18484892)
29 [0.00 pct] (0 of
18484892)
30 [0.00 pct] (0 of
18484892)
31 [0.00 pct] (0 of
18484892)
Total bytes used: 20912256 {noformat}
Curious how many 0-bit cases there are!
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]