zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1461257536
Thanks @jpountz for the review and comment!
>Did you manage to observe some speedups with this change?
So far I have only able to run `wikimedium10m` and see the implementation
has around -10% slow down (listed below) for full text boolean queries
`OrXXXNotYYYY` due to changes in `ReqExclScorer` and `Lucene90PostingsReader`
(and the facet ones don't seems to exercise the changes and should be just
random fluctuation). I'm currently still searching for any existing
benchmarking tasks that can measure these targeted use cases:
> Is it actually common to have long runs of matches? For full-text indexes,
maybe not so much, only stop words may have runs of adjacent matches. For
string fields, this may happen if the field has a default value that is the
value of most documents in the collection. Also it's possible for users to use
index sorting in order to cluster similar documents together, which increases
the likelihood to have long runs of adjacent matches.
Do you have any pointer which benchmark task I could potentially use? If
there isn't one available, I could try to add some next.
```
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
BrowseRandomLabelTaxoFacets 39.06 (49.1%) 33.42
(45.6%) -14.4% ( -73% - 157%) 0.335
BrowseDateTaxoFacets 34.22 (29.5%) 30.49
(27.9%) -10.9% ( -52% - 65%) 0.229
BrowseDayOfYearTaxoFacets 34.30 (29.4%) 30.57
(27.9%) -10.9% ( -52% - 65%) 0.230
OrNotHighHigh 426.33 (2.7%) 388.27
(1.6%) -8.9% ( -12% - -4%) 0.000
OrHighNotHigh 621.42 (2.7%) 573.32
(2.0%) -7.7% ( -12% - -3%) 0.000
OrHighNotMed 616.08 (3.8%) 573.20
(2.7%) -7.0% ( -13% - 0%) 0.000
OrHighNotLow 562.98 (4.0%) 525.80
(3.3%) -6.6% ( -13% - 0%) 0.000
OrNotHighMed 712.40 (2.6%) 672.88
(2.4%) -5.5% ( -10% - 0%) 0.000
HighTermTitleBDVSort 19.04 (7.9%) 18.73
(8.3%) -1.6% ( -16% - 15%) 0.534
HighIntervalsOrdered 1.89 (12.3%) 1.86
(15.0%) -1.6% ( -25% - 29%) 0.719
BrowseMonthTaxoFacets 32.23 (33.9%) 31.77
(33.6%) -1.4% ( -51% - 100%) 0.895
OrNotHighLow 1719.50 (3.8%) 1696.85
(4.6%) -1.3% ( -9% - 7%) 0.326
HighTermTitleSort 202.79 (2.9%) 200.20
(2.8%) -1.3% ( -6% - 4%) 0.155
AndHighHigh 51.74 (5.8%) 51.08
(5.4%) -1.3% ( -11% - 10%) 0.475
Fuzzy1 59.04 (2.7%) 58.36
(3.2%) -1.2% ( -6% - 4%) 0.214
MedTerm 1364.68 (4.4%) 1349.31
(3.4%) -1.1% ( -8% - 6%) 0.362
Wildcard 314.79 (2.8%) 311.35
(3.5%) -1.1% ( -7% - 5%) 0.277
LowTerm 2087.86 (3.2%) 2065.24
(3.8%) -1.1% ( -7% - 6%) 0.334
MedIntervalsOrdered 22.66 (8.6%) 22.42
(10.4%) -1.0% ( -18% - 19%) 0.730
PKLookup 331.54 (2.9%) 328.12
(2.6%) -1.0% ( -6% - 4%) 0.242
LowIntervalsOrdered 161.90 (9.5%) 160.23
(11.6%) -1.0% ( -20% - 22%) 0.758
Fuzzy2 100.43 (1.5%) 99.40
(3.0%) -1.0% ( -5% - 3%) 0.169
Respell 88.01 (2.0%) 87.27
(2.4%) -0.8% ( -5% - 3%) 0.223
BrowseDateSSDVFacets 4.89 (21.4%) 4.85
(20.2%) -0.8% ( -34% - 51%) 0.905
BrowseRandomLabelSSDVFacets 19.23 (7.1%) 19.09
(6.2%) -0.7% ( -13% - 13%) 0.728
AndHighMed 114.49 (5.6%) 113.77
(5.0%) -0.6% ( -10% - 10%) 0.708
Prefix3 376.91 (1.4%) 374.65
(2.5%) -0.6% ( -4% - 3%) 0.348
HighTermMonthSort 4250.83 (4.2%) 4227.31
(3.6%) -0.6% ( -7% - 7%) 0.653
OrHighMed 209.50 (6.2%) 208.61
(3.4%) -0.4% ( -9% - 9%) 0.787
LowPhrase 89.33 (3.0%) 88.96
(2.2%) -0.4% ( -5% - 4%) 0.623
BrowseDayOfYearSSDVFacets 24.82 (10.9%) 24.75
(11.2%) -0.3% ( -20% - 24%) 0.940
AndHighMedDayTaxoFacets 158.35 (1.6%) 158.08
(1.8%) -0.2% ( -3% - 3%) 0.756
HighTerm 2076.24 (3.7%) 2074.83
(2.9%) -0.1% ( -6% - 6%) 0.949
AndHighHighDayTaxoFacets 14.81 (2.4%) 14.81
(2.9%) -0.0% ( -5% - 5%) 0.992
HighSpanNear 11.02 (2.0%) 11.02
(2.4%) 0.0% ( -4% - 4%) 0.951
LowSpanNear 178.01 (1.7%) 178.17
(1.8%) 0.1% ( -3% - 3%) 0.864
OrHighLow 473.27 (6.2%) 473.94
(3.3%) 0.1% ( -8% - 10%) 0.929
TermDTSort 230.93 (4.7%) 231.36
(3.3%) 0.2% ( -7% - 8%) 0.885
MedSloppyPhrase 20.69 (3.0%) 20.76
(3.0%) 0.3% ( -5% - 6%) 0.721
MedSpanNear 80.38 (2.2%) 80.66
(2.1%) 0.3% ( -3% - 4%) 0.618
MedPhrase 53.03 (1.8%) 53.23
(1.8%) 0.4% ( -3% - 3%) 0.520
AndHighLow 2127.04 (4.1%) 2136.55
(3.4%) 0.4% ( -6% - 8%) 0.706
HighTermDayOfYearSort 594.24 (6.9%) 597.03
(6.3%) 0.5% ( -11% - 14%) 0.822
OrHighHigh 53.90 (5.7%) 54.21
(4.0%) 0.6% ( -8% - 10%) 0.709
MedTermDayTaxoFacets 41.47 (1.1%) 41.75
(2.8%) 0.7% ( -3% - 4%) 0.311
HighPhrase 119.60 (2.0%) 120.53
(1.8%) 0.8% ( -2% - 4%) 0.195
LowSloppyPhrase 166.00 (5.7%) 167.36
(5.5%) 0.8% ( -9% - 12%) 0.644
HighSloppyPhrase 43.02 (5.5%) 43.40
(5.2%) 0.9% ( -9% - 12%) 0.610
BrowseMonthSSDVFacets 24.48 (8.6%) 24.85
(11.6%) 1.5% ( -17% - 23%) 0.638
OrHighMedDayTaxoFacets 7.71 (3.7%) 7.86
(6.0%) 1.9% ( -7% - 12%) 0.218
IntNRQ 115.47 (14.3%) 118.04
(13.2%) 2.2% ( -22% - 34%) 0.610
```
>You explored implementing this new API in several different places:
BitSetIterator, doc-value iterator, postings, etc. and it's already a bit
exhausting to review and will get worse when we add more tests. I think it
would be helpful if we focused on a single thing for the initial PR that
focuses on proving that this API is a good addition, adds good testing, and
then implement the new API on other implementations of DocIdSetIterator in
follow-up PRs.
For sure. Once I'm able to benchmark this and observe good speed up & we are
good with the API, I will break up this PR into smaller pieces.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]