[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308216#comment-17308216
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
I ran a luceneutil benchmark comparing my PFOR approach to encoding doc ID
deltas (available
[here|https://github.com/gsmiller/lucene/tree/LUCENE-9850/pfordocids]) to the
main branch. Here are the results. This is the first luceneutil benchmark I've
run, so I'm still getting familiar with the tool and interpreting results. This
was run with the "wikimediumall" source. If I'm interpreting these results
correctly, it looks like there is a pretty material performance penalty to
using PFOR instead of FOR, but I'd be curious what other, more experienced
folks see in these results. I'll see if I can get some figures on the index
size difference as well, but I'm not sure there's a good path forward here with
these QPS results.
{code:java}
TaskQPS baseline StdDevQPS pfor doc ids StdDev
Pct diff p-value
TermDTSort 38.02 (11.8%) 36.08 (8.9%)
-5.1% ( -23% - 17%) 0.123
OrNotHighLow 488.43 (5.8%) 466.01 (6.2%)
-4.6% ( -15% - 7%) 0.016
HighTerm 1276.94 (5.0%) 1222.31 (5.3%)
-4.3% ( -13% - 6%) 0.009
HighTermDayOfYearSort 51.64 (11.6%) 49.66 (8.0%)
-3.8% ( -20% - 17%) 0.223
HighTermMonthSort 59.36 (10.6%) 57.09 (11.4%)
-3.8% ( -23% - 20%) 0.272
HighTermTitleBDVSort 36.61 (16.4%) 35.27 (19.2%)
-3.7% ( -33% - 38%) 0.517
AndHighHigh 11.06 (3.7%) 10.67 (3.0%)
-3.5% ( -9% - 3%) 0.001
OrHighNotHigh 568.03 (10.4%) 548.46 (7.7%)
-3.4% ( -19% - 16%) 0.233
OrHighLow 261.36 (3.9%) 252.58 (3.7%)
-3.4% ( -10% - 4%) 0.005
AndHighMed 82.45 (3.1%) 79.71 (3.1%)
-3.3% ( -9% - 2%) 0.001
MedPhrase 40.33 (5.4%) 39.02 (4.7%)
-3.2% ( -12% - 7%) 0.043
Wildcard 25.19 (2.8%) 24.46 (2.7%)
-2.9% ( -8% - 2%) 0.001
LowSpanNear 5.52 (2.0%) 5.36 (2.3%)
-2.9% ( -6% - 1%) 0.000
AndHighLow 203.23 (2.9%) 197.52 (2.6%)
-2.8% ( -8% - 2%) 0.001
OrHighMed 19.99 (2.0%) 19.43 (2.1%)
-2.8% ( -6% - 1%) 0.000
MedTerm 829.73 (6.4%) 807.65 (5.1%)
-2.7% ( -13% - 9%) 0.144
OrHighNotLow 482.63 (4.8%) 469.91 (5.5%)
-2.6% ( -12% - 8%) 0.105
OrHighHigh 9.20 (2.0%) 8.97 (2.3%)
-2.5% ( -6% - 1%) 0.000
LowPhrase 16.16 (3.3%) 15.76 (2.7%)
-2.5% ( -8% - 3%) 0.009
MedSpanNear 3.14 (2.1%) 3.07 (2.3%)
-2.3% ( -6% - 2%) 0.001
Prefix3 121.86 (8.5%) 119.12 (6.5%)
-2.2% ( -15% - 13%) 0.349
OrNotHighMed 477.93 (6.0%) 467.27 (6.7%)
-2.2% ( -14% - 11%) 0.268
HighSpanNear 9.24 (2.2%) 9.05 (2.1%)
-2.0% ( -6% - 2%) 0.004
MedSloppyPhrase 16.95 (2.9%) 16.67 (3.0%)
-1.7% ( -7% - 4%) 0.069
IntNRQ 49.47 (2.6%) 48.88 (1.6%)
-1.2% ( -5% - 3%) 0.087
LowSloppyPhrase 30.67 (2.7%) 30.33 (2.8%)
-1.1% ( -6% - 4%) 0.198
LowTerm 984.89 (4.7%) 973.96 (3.1%)
-1.1% ( -8% - 7%) 0.380
OrNotHighHigh 476.25 (8.3%) 471.56 (7.9%)
-1.0% ( -15% - 16%) 0.701
HighIntervalsOrdered 4.20 (2.2%) 4.18 (2.4%)
-0.7% ( -5% - 3%) 0.347
OrHighNotMed 445.69 (5.1%) 443.29 (5.6%)
-0.5% ( -10% - 10%) 0.750
BrowseMonthTaxoFacets 1.41 (1.2%) 1.41 (1.4%)
-0.3% ( -2% - 2%) 0.427
PKLookup 127.78 (3.2%) 127.46 (2.9%)
-0.3% ( -6% - 6%) 0.794
BrowseDayOfYearTaxoFacets 1.22 (2.1%) 1.22 (2.1%)
-0.2% ( -4% - 4%) 0.735
BrowseDateTaxoFacets 1.23 (2.0%) 1.23 (2.0%)
-0.2% ( -4% - 3%) 0.782
BrowseDayOfYearSSDVFacets 2.90 (0.9%) 2.90 (1.0%)
-0.1% ( -1% - 1%) 0.685
BrowseMonthSSDVFacets 3.15 (1.0%) 3.15 (1.1%)
0.0% ( -2% - 2%) 0.903
HighSloppyPhrase 7.89 (5.5%) 7.91 (4.4%)
0.2% ( -9% - 10%) 0.876
Fuzzy2 34.20 (9.0%) 34.31 (8.4%)
0.3% ( -15% - 19%) 0.909
Fuzzy1 44.78 (6.2%) 44.95 (6.0%)
0.4% ( -11% - 13%) 0.851
Respell 21.07 (2.5%) 21.16 (2.4%)
0.5% ( -4% - 5%) 0.552
HighPhrase 274.21 (5.3%) 279.29 (5.3%)
1.9% ( -8% - 13%) 0.269
{code}
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]