[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308723#comment-17308723
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
Also, here's the direct impact on bits-per-value on the EN Wikipedia index
generated from the benchmark "-source wikimediiumall" task. Looks like ~10%
reduction in the doc ID block. This is a little naive though since it's not
taking into account extra storage with PFOR for the exceptions, but it helps
illustrate the difference that's creating the 2% index size reduction.
{code:java}
BASELINE (FOR)
DOC ID BPV
0 **** [6.86 pct] (1529467 of
22287406)
1 * [0.00 pct] (91 of
22287406)
2 * [0.60 pct] (133848 of
22287406)
3 ** [2.09 pct] (466022 of
22287406)
4 ** [3.06 pct] (683006 of
22287406)
5 *** [4.44 pct] (990644 of
22287406)
6 *** [5.86 pct] (1305537 of
22287406)
7 ***** [8.38 pct] (1867660 of
22287406)
8 ***** [9.92 pct] (2211136 of
22287406)
9 ****** [10.79 pct] (2405504 of
22287406)
10 ***** [9.77 pct] (2178356 of
22287406)
11 ***** [8.61 pct] (1919968 of
22287406)
12 **** [7.63 pct] (1701251 of
22287406)
13 **** [6.40 pct] (1426872 of
22287406)
14 *** [4.94 pct] (1101624 of
22287406)
15 ** [3.62 pct] (806380 of
22287406)
16 ** [2.62 pct] (583235 of
22287406)
17 * [1.83 pct] (407402 of
22287406)
18 * [1.28 pct] (285690 of
22287406)
19 * [0.78 pct] (172866 of
22287406)
20 * [0.27 pct] (59108 of
22287406)
21 * [0.12 pct] (26582 of
22287406)
22 * [0.08 pct] (17481 of
22287406)
23 * [0.03 pct] (7676 of
22287406)
24 [0.00 pct] (0 of
22287406)
25 [0.00 pct] (0 of
22287406)
26 [0.00 pct] (0 of
22287406)
27 [0.00 pct] (0 of
22287406)
28 [0.00 pct] (0 of
22287406)
29 [0.00 pct] (0 of
22287406)
30 [0.00 pct] (0 of
22287406)
31 [0.00 pct] (0 of
22287406)
Total bytes used: 25746066
CANDIDATE (PFOR)
DOC ID BPV
0 **** [7.06 pct] (1573609 of
22287406)
1 * [0.62 pct] (139054 of
22287406)
2 ** [2.12 pct] (471777 of
22287406)
3 ** [3.70 pct] (824652 of
22287406)
4 *** [4.95 pct] (1102450 of
22287406)
5 *** [5.68 pct] (1266069 of
22287406)
6 **** [7.95 pct] (1772639 of
22287406)
7 ***** [9.86 pct] (2197883 of
22287406)
8 ***** [9.92 pct] (2211276 of
22287406)
9 ***** [9.25 pct] (2061395 of
22287406)
10 ***** [8.53 pct] (1902012 of
22287406)
11 **** [7.68 pct] (1710722 of
22287406)
12 **** [6.41 pct] (1427739 of
22287406)
13 *** [5.01 pct] (1117073 of
22287406)
14 ** [3.89 pct] (866890 of
22287406)
15 ** [2.81 pct] (627122 of
22287406)
16 * [2.00 pct] (444684 of
22287406)
17 * [1.38 pct] (308501 of
22287406)
18 * [0.91 pct] (203542 of
22287406)
19 * [0.24 pct] (52612 of
22287406)
20 * [0.03 pct] (5689 of
22287406)
21 * [0.00 pct] (16 of
22287406)
22 [0.00 pct] (0 of
22287406)
23 [0.00 pct] (0 of
22287406)
24 [0.00 pct] (0 of
22287406)
25 [0.00 pct] (0 of
22287406)
26 [0.00 pct] (0 of
22287406)
27 [0.00 pct] (0 of
22287406)
28 [0.00 pct] (0 of
22287406)
29 [0.00 pct] (0 of
22287406)
30 [0.00 pct] (0 of
22287406)
31 [0.00 pct] (0 of
22287406)
Total bytes used: 23091702
{code}
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]