[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312792#comment-17312792
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
I gave this benchmark another run now that PFOR has been updated from 3
allowable exceptions to 7. As expected, the index size reduction is further
improved, but the QPS regressions appear to get worse. Here's what I see:
Note: Still using "-source wikimediumall" (wikimedium.10M.nostopwords.tasks).
The doc ID payload portion of the index is reduced 11.9% (~3.3GB -> ~2.9GB).
The overall index is reduced 3.3% (~11.6GB -> ~11.2GB).
{code:java}
BASELINE
DOC ID BPV
0 **** [6.86 pct] (1529467 of
22287406)
1 * [0.00 pct] (91 of
22287406)
2 * [0.60 pct] (133848 of
22287406)
3 ** [2.09 pct] (466022 of
22287406)
4 ** [3.06 pct] (683006 of
22287406)
5 *** [4.44 pct] (990644 of
22287406)
6 *** [5.86 pct] (1305537 of
22287406)
7 ***** [8.38 pct] (1867660 of
22287406)
8 ***** [9.92 pct] (2211136 of
22287406)
9 ****** [10.79 pct] (2405504 of
22287406)
10 ***** [9.77 pct] (2178356 of
22287406)
11 ***** [8.61 pct] (1919968 of
22287406)
12 **** [7.63 pct] (1701251 of
22287406)
13 **** [6.40 pct] (1426872 of
22287406)
14 *** [4.94 pct] (1101624 of
22287406)
15 ** [3.62 pct] (806380 of
22287406)
16 ** [2.62 pct] (583235 of
22287406)
17 * [1.83 pct] (407402 of
22287406)
18 * [1.28 pct] (285690 of
22287406)
19 * [0.78 pct] (172866 of
22287406)
20 * [0.27 pct] (59108 of
22287406)
21 * [0.12 pct] (26582 of
22287406)
22 * [0.08 pct] (17481 of
22287406)
23 * [0.03 pct] (7676 of
22287406)
24 [0.00 pct] (0 of
22287406)
25 [0.00 pct] (0 of
22287406)
26 [0.00 pct] (0 of
22287406)
27 [0.00 pct] (0 of
22287406)
28 [0.00 pct] (0 of
22287406)
29 [0.00 pct] (0 of
22287406)
30 [0.00 pct] (0 of
22287406)
31 [0.00 pct] (0 of
22287406)
Total bytes used: 3295496560
NEW CANDIDATE (PFOR doc IDs with 7 exceptions)
DOC ID BPV
0 **** [7.07 pct] (1576532 of
22287406)
1 * [1.44 pct] (321744 of
22287406)
2 ** [3.74 pct] (834608 of
22287406)
3 *** [4.58 pct] (1019776 of
22287406)
4 *** [5.70 pct] (1271157 of
22287406)
5 **** [6.56 pct] (1463046 of
22287406)
6 ***** [9.28 pct] (2068438 of
22287406)
7 ***** [9.71 pct] (2163462 of
22287406)
8 ***** [9.41 pct] (2097645 of
22287406)
9 ***** [8.58 pct] (1911927 of
22287406)
10 ***** [8.08 pct] (1801505 of
22287406)
11 **** [6.92 pct] (1542164 of
22287406)
12 *** [5.52 pct] (1231201 of
22287406)
13 *** [4.30 pct] (957713 of
22287406)
14 ** [3.37 pct] (750159 of
22287406)
15 ** [2.38 pct] (531051 of
22287406)
16 * [1.65 pct] (367735 of
22287406)
17 * [1.15 pct] (255594 of
22287406)
18 * [0.52 pct] (116752 of
22287406)
19 * [0.02 pct] (5197 of
22287406)
20 [0.00 pct] (0 of
22287406)
21 [0.00 pct] (0 of
22287406)
22 [0.00 pct] (0 of
22287406)
23 [0.00 pct] (0 of
22287406)
24 [0.00 pct] (0 of
22287406)
25 [0.00 pct] (0 of
22287406)
26 [0.00 pct] (0 of
22287406)
27 [0.00 pct] (0 of
22287406)
28 [0.00 pct] (0 of
22287406)
29 [0.00 pct] (0 of
22287406)
30 [0.00 pct] (0 of
22287406)
31 [0.00 pct] (0 of
22287406)
Total bytes used: 2904198119
{code}
QPS regressions as follows:
{code:java}
TaskQPS baseline StdDevQPS pfordocids StdDev
Pct diff p-value
Prefix3 163.80 (13.3%) 145.05 (8.8%)
-11.4% ( -29% - 12%) 0.001
AndHighMed 55.87 (4.5%) 51.35 (2.6%)
-8.1% ( -14% - 0%) 0.000
LowSpanNear 8.15 (1.8%) 7.69 (1.8%)
-5.6% ( -8% - -2%) 0.000
OrNotHighMed 511.04 (7.0%) 484.78 (5.0%)
-5.1% ( -16% - 7%) 0.008
AndHighLow 295.02 (3.5%) 279.93 (3.1%)
-5.1% ( -11% - 1%) 0.000
OrNotHighLow 516.68 (6.4%) 491.41 (4.7%)
-4.9% ( -15% - 6%) 0.006
HighSpanNear 12.33 (2.0%) 11.74 (1.6%)
-4.7% ( -8% - -1%) 0.000
OrNotHighHigh 398.33 (6.7%) 381.31 (6.8%)
-4.3% ( -16% - 9%) 0.046
MedSpanNear 7.42 (2.0%) 7.14 (2.2%)
-3.8% ( -7% - 0%) 0.000
Wildcard 148.87 (11.6%) 143.67 (10.4%)
-3.5% ( -22% - 20%) 0.315
HighTermMonthSort 35.65 (15.2%) 34.48 (12.2%)
-3.3% ( -26% - 28%) 0.454
AndHighHigh 17.88 (2.7%) 17.32 (2.9%)
-3.1% ( -8% - 2%) 0.000
MedPhrase 11.16 (4.2%) 10.83 (3.2%)
-3.0% ( -9% - 4%) 0.013
TermDTSort 41.80 (14.5%) 40.67 (12.0%)
-2.7% ( -25% - 27%) 0.522
LowPhrase 38.27 (5.0%) 37.26 (4.5%)
-2.6% ( -11% - 7%) 0.082
OrHighNotHigh 553.81 (8.1%) 541.01 (7.5%)
-2.3% ( -16% - 14%) 0.347
MedSloppyPhrase 7.30 (2.1%) 7.14 (3.1%)
-2.2% ( -7% - 3%) 0.008
OrHighMed 40.15 (3.4%) 39.27 (2.8%)
-2.2% ( -8% - 4%) 0.027
OrHighHigh 7.29 (2.6%) 7.13 (2.8%)
-2.2% ( -7% - 3%) 0.011
OrHighLow 166.87 (5.1%) 163.23 (4.3%)
-2.2% ( -11% - 7%) 0.145
HighTermTitleBDVSort 18.65 (10.8%) 18.25 (12.6%)
-2.1% ( -22% - 23%) 0.569
HighTermDayOfYearSort 30.11 (12.3%) 29.57 (11.8%)
-1.8% ( -22% - 25%) 0.641
HighSloppyPhrase 5.12 (2.4%) 5.03 (3.6%)
-1.7% ( -7% - 4%) 0.079
HighPhrase 113.33 (6.4%) 111.61 (6.0%)
-1.5% ( -13% - 11%) 0.437
Fuzzy2 36.81 (7.1%) 36.33 (7.9%)
-1.3% ( -15% - 14%) 0.584
HighIntervalsOrdered 8.65 (1.5%) 8.54 (1.7%)
-1.2% ( -4% - 2%) 0.016
LowSloppyPhrase 68.84 (1.7%) 68.10 (2.4%)
-1.1% ( -5% - 3%) 0.101
OrHighNotLow 517.95 (8.7%) 514.80 (7.0%)
-0.6% ( -15% - 16%) 0.807
MedTerm 907.88 (6.4%) 902.40 (7.2%)
-0.6% ( -13% - 13%) 0.779
Respell 31.17 (2.8%) 31.10 (2.7%)
-0.2% ( -5% - 5%) 0.785
BrowseMonthSSDVFacets 3.15 (1.6%) 3.14 (1.1%)
-0.2% ( -2% - 2%) 0.662
BrowseMonthTaxoFacets 1.40 (1.7%) 1.40 (1.2%)
0.2% ( -2% - 3%) 0.694
HighTerm 713.93 (4.2%) 715.28 (4.6%)
0.2% ( -8% - 9%) 0.893
OrHighNotMed 445.75 (7.5%) 446.79 (9.4%)
0.2% ( -15% - 18%) 0.931
BrowseDayOfYearTaxoFacets 1.21 (2.7%) 1.21 (2.6%)
0.3% ( -4% - 5%) 0.761
BrowseDateTaxoFacets 1.21 (2.5%) 1.22 (2.4%)
0.3% ( -4% - 5%) 0.710
BrowseDayOfYearSSDVFacets 2.89 (1.5%) 2.90 (1.1%)
0.4% ( -2% - 2%) 0.352
PKLookup 128.08 (5.8%) 128.60 (5.7%)
0.4% ( -10% - 12%) 0.822
IntNRQ 15.95 (18.3%) 16.05 (18.9%)
0.6% ( -30% - 46%) 0.914
LowTerm 906.79 (5.0%) 925.27 (4.9%)
2.0% ( -7% - 12%) 0.193
Fuzzy1 38.59 (6.4%) 39.40 (6.5%)
2.1% ( -10% - 16%) 0.301
{code}
I'd love to find a way to cut down on this QPS regression since there's a
decent index size reduction to be had here. I'll have to see if I can figure
out any way to further optimize this.
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]