[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309120#comment-17309120
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
Ok, not as bad with some more optimizations in place (thanks [~jpountz]!), but
still a regression. Here's what I'm seeing (still with "-source wikimediumall"
as before):
{code:java}
TaskQPS baseline StdDevQPS pfor 7 exceptions
StdDev Pct diff p-value
AndHighMed 40.34 (2.6%) 38.83 (2.1%)
-3.7% ( -8% - 0%) 0.000
Prefix3 13.95 (1.6%) 13.55 (1.8%)
-2.9% ( -6% - 0%) 0.000
OrHighMed 42.52 (2.5%) 41.33 (3.7%)
-2.8% ( -8% - 3%) 0.004
OrHighLow 249.72 (3.8%) 242.87 (4.8%)
-2.7% ( -10% - 6%) 0.046
AndHighLow 320.47 (3.7%) 311.87 (4.0%)
-2.7% ( -10% - 5%) 0.028
LowPhrase 15.24 (2.3%) 14.88 (1.8%)
-2.3% ( -6% - 1%) 0.000
OrNotHighLow 459.84 (4.1%) 449.82 (4.2%)
-2.2% ( -10% - 6%) 0.094
MedTerm 975.99 (4.2%) 954.87 (4.1%)
-2.2% ( -10% - 6%) 0.101
OrNotHighHigh 380.66 (4.2%) 372.45 (5.6%)
-2.2% ( -11% - 7%) 0.167
OrHighNotLow 494.46 (4.6%) 484.75 (5.8%)
-2.0% ( -11% - 8%) 0.234
Wildcard 57.07 (1.9%) 56.04 (1.4%)
-1.8% ( -5% - 1%) 0.001
OrHighNotHigh 422.27 (5.4%) 414.76 (3.8%)
-1.8% ( -10% - 7%) 0.227
OrHighHigh 13.69 (1.8%) 13.47 (3.5%)
-1.6% ( -6% - 3%) 0.065
LowSloppyPhrase 15.05 (3.5%) 14.82 (4.0%)
-1.5% ( -8% - 6%) 0.199
Fuzzy2 22.48 (5.7%) 22.15 (4.7%)
-1.5% ( -11% - 9%) 0.376
OrNotHighMed 454.42 (4.8%) 447.81 (5.3%)
-1.5% ( -11% - 9%) 0.362
TermDTSort 43.90 (11.6%) 43.27 (10.4%)
-1.4% ( -21% - 23%) 0.678
LowSpanNear 4.39 (2.6%) 4.32 (1.9%)
-1.4% ( -5% - 3%) 0.050
HighSloppyPhrase 2.77 (3.2%) 2.73 (3.2%)
-1.2% ( -7% - 5%) 0.251
HighTermDayOfYearSort 6.33 (13.6%) 6.26 (13.3%)
-1.0% ( -24% - 29%) 0.806
HighIntervalsOrdered 1.08 (0.9%) 1.07 (1.2%)
-1.0% ( -3% - 1%) 0.003
AndHighHigh 40.58 (3.2%) 40.22 (3.1%)
-0.9% ( -6% - 5%) 0.378
HighTerm 792.80 (5.1%) 789.86 (5.0%)
-0.4% ( -9% - 10%) 0.816
OrHighNotMed 509.78 (6.4%) 508.18 (5.5%)
-0.3% ( -11% - 12%) 0.868
MedSpanNear 4.96 (2.1%) 4.95 (1.5%)
-0.3% ( -3% - 3%) 0.666
MedPhrase 81.04 (1.8%) 80.85 (3.0%)
-0.2% ( -4% - 4%) 0.763
MedSloppyPhrase 9.10 (3.8%) 9.08 (3.6%)
-0.2% ( -7% - 7%) 0.851
IntNRQ 19.09 (0.6%) 19.05 (0.8%)
-0.2% ( -1% - 1%) 0.367
HighTermTitleBDVSort 34.87 (11.3%) 34.86 (13.4%)
-0.0% ( -22% - 27%) 0.995
BrowseMonthSSDVFacets 3.14 (1.0%) 3.14 (1.1%)
0.0% ( -2% - 2%) 0.976
HighTermMonthSort 18.38 (13.3%) 18.42 (18.3%)
0.2% ( -27% - 36%) 0.969
BrowseDayOfYearSSDVFacets 2.89 (0.9%) 2.90 (1.1%)
0.2% ( -1% - 2%) 0.492
LowTerm 969.07 (4.7%) 971.45 (4.1%)
0.2% ( -8% - 9%) 0.860
HighSpanNear 3.27 (2.1%) 3.28 (1.8%)
0.3% ( -3% - 4%) 0.608
Respell 33.76 (1.2%) 33.94 (1.3%)
0.5% ( -1% - 3%) 0.185
PKLookup 123.25 (2.7%) 124.06 (3.2%)
0.7% ( -5% - 6%) 0.485
HighPhrase 218.15 (3.2%) 219.85 (3.0%)
0.8% ( -5% - 7%) 0.428
BrowseMonthTaxoFacets 1.39 (1.7%) 1.41 (1.7%)
0.9% ( -2% - 4%) 0.110
BrowseDateTaxoFacets 1.20 (2.2%) 1.22 (2.2%)
1.1% ( -3% - 5%) 0.114
BrowseDayOfYearTaxoFacets 1.20 (2.4%) 1.21 (2.4%)
1.2% ( -3% - 6%) 0.109
Fuzzy1 46.25 (8.0%) 47.43 (9.4%)
2.6% ( -13% - 21%) 0.354
{code}
The modifications on PForUtil this was run with are
[here|https://github.com/apache/lucene/compare/main...gsmiller:LUCENE-9850/pfordocids#diff-9f4cb4a664b2a8f0594b221368085548a58ecb1cc1290f18160b613d400fcc29].
I'll think about whether-or-not there's maybe further opportunities to
optimize this. There's a lot of branching in there, but I'm not sure how much
of it is avoidable. I'll put some fresh eyes on it tomorrow.
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]