[
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313877#comment-17313877
]
Greg Miller commented on LUCENE-9850:
-------------------------------------
I haven't really been following along with what's going on in JDK17, but being
able to more explicitly generate vectorized instructions will be nice!
I optimized my branch a bit further. At this point I think it has all the
optimizations of ForDeltaUtil, and the only extra work is applying the
exceptions. I even pulled in the optimization to apply the prefix to two values
at a time (packed in a long). That code is over
[here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3].
(It's not polished at all... a bit hacky)
The benchmark results are looking at lot better now, but maybe still some
regressions. I've seen a little variability in these results, so I'm not sure
how often they might present false-regression results on individual tasks?
Here's what I've got at this point:
{code:java}
TaskQPS baseline StdDevQPS pfordocids StdDev
Pct diff p-value
LowSpanNear 98.61 (2.2%) 95.57 (1.7%)
-3.1% ( -6% - 0%) 0.000
OrNotHighHigh 545.01 (3.8%) 531.11 (5.3%)
-2.6% ( -11% - 6%) 0.078
Wildcard 40.83 (4.1%) 40.05 (3.9%)
-1.9% ( -9% - 6%) 0.132
OrHighMed 102.39 (2.5%) 100.50 (2.5%)
-1.8% ( -6% - 3%) 0.021
AndHighHigh 50.93 (3.3%) 50.03 (3.1%)
-1.8% ( -7% - 4%) 0.079
TermDTSort 98.42 (11.6%) 96.72 (14.4%)
-1.7% ( -24% - 27%) 0.676
AndHighMed 68.10 (2.9%) 66.94 (2.9%)
-1.7% ( -7% - 4%) 0.063
HighTerm 1169.43 (4.4%) 1151.70 (5.1%)
-1.5% ( -10% - 8%) 0.314
BrowseMonthSSDVFacets 12.50 (5.6%) 12.31 (7.7%)
-1.5% ( -14% - 12%) 0.480
HighTermTitleBDVSort 157.42 (14.7%) 155.08 (15.6%)
-1.5% ( -27% - 33%) 0.757
OrHighNotLow 545.83 (5.7%) 537.85 (7.0%)
-1.5% ( -13% - 12%) 0.472
MedSpanNear 28.75 (2.4%) 28.34 (1.9%)
-1.4% ( -5% - 2%) 0.038
OrHighNotHigh 533.41 (4.6%) 526.33 (5.3%)
-1.3% ( -10% - 8%) 0.394
Fuzzy1 59.47 (6.0%) 58.72 (6.8%)
-1.3% ( -13% - 12%) 0.533
HighSpanNear 21.27 (2.6%) 21.03 (2.2%)
-1.1% ( -5% - 3%) 0.153
HighTermMonthSort 128.50 (12.2%) 127.11 (11.0%)
-1.1% ( -21% - 25%) 0.769
OrNotHighLow 640.89 (4.0%) 634.43 (3.5%)
-1.0% ( -8% - 6%) 0.395
OrHighHigh 21.11 (2.0%) 20.91 (1.8%)
-1.0% ( -4% - 2%) 0.113
MedPhrase 103.90 (3.0%) 103.05 (2.9%)
-0.8% ( -6% - 5%) 0.381
HighPhrase 172.59 (2.5%) 171.22 (2.5%)
-0.8% ( -5% - 4%) 0.320
OrHighNotMed 535.67 (4.8%) 531.54 (4.7%)
-0.8% ( -9% - 9%) 0.607
LowTerm 1094.41 (2.9%) 1087.97 (3.1%)
-0.6% ( -6% - 5%) 0.535
MedSloppyPhrase 12.91 (2.4%) 12.85 (2.5%)
-0.5% ( -5% - 4%) 0.542
IntNRQ 101.21 (0.5%) 100.81 (0.7%)
-0.4% ( -1% - 0%) 0.040
PKLookup 144.62 (3.0%) 144.11 (3.1%)
-0.4% ( -6% - 5%) 0.715
HighSloppyPhrase 3.75 (2.9%) 3.74 (3.0%)
-0.3% ( -6% - 5%) 0.726
HighIntervalsOrdered 16.00 (2.1%) 15.95 (1.8%)
-0.3% ( -4% - 3%) 0.597
HighTermDayOfYearSort 109.37 (11.0%) 109.03 (15.2%)
-0.3% ( -23% - 29%) 0.941
LowSloppyPhrase 41.05 (1.9%) 40.93 (2.1%)
-0.3% ( -4% - 3%) 0.635
MedTerm 1137.13 (4.1%) 1134.84 (4.2%)
-0.2% ( -8% - 8%) 0.877
BrowseDayOfYearTaxoFacets 4.24 (3.4%) 4.23 (3.2%)
-0.2% ( -6% - 6%) 0.885
Prefix3 263.31 (9.1%) 263.08 (9.2%)
-0.1% ( -16% - 20%) 0.976
BrowseDateTaxoFacets 4.23 (3.4%) 4.23 (3.2%)
-0.1% ( -6% - 6%) 0.941
Fuzzy2 44.14 (13.1%) 44.17 (14.2%)
0.1% ( -24% - 31%) 0.990
BrowseMonthTaxoFacets 5.00 (2.6%) 5.01 (2.2%)
0.1% ( -4% - 4%) 0.878
OrNotHighMed 508.22 (3.6%) 508.82 (3.9%)
0.1% ( -7% - 7%) 0.921
Respell 38.75 (2.3%) 38.82 (2.2%)
0.2% ( -4% - 4%) 0.791
BrowseDayOfYearSSDVFacets 11.38 (5.7%) 11.42 (5.8%)
0.3% ( -10% - 12%) 0.855
OrHighLow 220.12 (4.0%) 220.98 (3.7%)
0.4% ( -6% - 8%) 0.745
AndHighLow 620.94 (3.0%) 624.41 (3.3%)
0.6% ( -5% - 7%) 0.573
LowPhrase 132.77 (2.1%) 133.52 (2.4%)
0.6% ( -3% - 5%) 0.433
{code}
> Explore PFOR for Doc ID delta encoding (instead of FOR)
> -------------------------------------------------------
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding.
> Right now PFOR is used for positions, frequencies and payloads, but FOR is
> used for doc ID deltas. From a recent
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
> on the dev mailing list, it sounds like this decision was made based on the
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with
> switching to PFOR compared to the performance reduction we might see by no
> longer being able to apply the deltas in as optimal a way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]