zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1475711031
Hi @jpountz, I was able to create a sorted index with new low-cardinality
field `quarter`, run some new benchmark tasks like below, and see substantial
improvement for the new tasks (with around -7% impacts to not-long-running
terms):
```
AndHighNotQuarter: +last -quarter:q1 # freq=830278
AndHighNotQuarter: +united -quarter:q2 # freq=1185528
AndHighNotQuarter: +year -quarter:q3 # freq=1098425
AndHighNotQuarter: +its -quarter:q4 # freq=1160703
AndHighNotQuarter: +but -quarter:q1 # freq=1484398
AndMedNotQuarter: +mostly -quarter:q2 # freq=89401
AndMedNotQuarter: +interview -quarter:q3 # freq=94736
AndMedNotQuarter: +9 -quarter:q4 # freq=541405
AndMedNotQuarter: +hard -quarter:q1 # freq=92045
AndMedNotQuarter: +bay -quarter:q2 # freq=117167
```
```
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
OrNotHighHigh 574.77 (3.1%) 535.14
(3.2%) -6.9% ( -12% - 0%) 0.000
OrHighNotHigh 423.02 (3.6%) 396.20
(3.6%) -6.3% ( -13% - 0%) 0.000
OrHighNotMed 746.53 (3.7%) 702.90
(4.5%) -5.8% ( -13% - 2%) 0.000
OrHighNotLow 750.04 (4.4%) 712.78
(4.5%) -5.0% ( -13% - 4%) 0.000
OrNotHighMed 653.88 (3.0%) 624.44
(3.9%) -4.5% ( -11% - 2%) 0.000
OrNotHighLow 1205.35 (5.4%) 1185.91
(5.8%) -1.6% ( -12% - 10%) 0.363
PKLookup 269.07 (2.4%) 266.45
(3.0%) -1.0% ( -6% - 4%) 0.259
AndHighMed 146.24 (6.4%) 144.93
(6.1%) -0.9% ( -12% - 12%) 0.649
Wildcard 220.71 (3.3%) 218.75
(2.7%) -0.9% ( -6% - 5%) 0.352
HighTermTitleSort 148.58 (3.3%) 147.26
(2.7%) -0.9% ( -6% - 5%) 0.347
HighTermTitleBDVSort 35.47 (2.5%) 35.18
(2.0%) -0.8% ( -5% - 3%) 0.247
BrowseMonthTaxoFacets 35.20 (35.1%) 34.93
(38.6%) -0.8% ( -55% - 112%) 0.948
HighTerm 714.46 (4.1%) 710.25
(3.5%) -0.6% ( -7% - 7%) 0.626
MedTerm 1063.86 (4.3%) 1058.95
(3.4%) -0.5% ( -7% - 7%) 0.707
OrHighMed 192.38 (3.0%) 191.49
(2.6%) -0.5% ( -5% - 5%) 0.602
Prefix3 259.99 (5.9%) 258.85
(6.3%) -0.4% ( -11% - 12%) 0.821
OrHighHigh 39.21 (2.9%) 39.06
(2.2%) -0.4% ( -5% - 4%) 0.639
LowPhrase 65.71 (2.5%) 65.53
(2.6%) -0.3% ( -5% - 4%) 0.723
Respell 105.97 (2.6%) 105.72
(1.7%) -0.2% ( -4% - 4%) 0.740
MedTermDayTaxoFacets 38.03 (2.7%) 37.98
(2.2%) -0.1% ( -4% - 4%) 0.869
HighSpanNear 10.67 (2.5%) 10.66
(2.3%) -0.1% ( -4% - 4%) 0.865
MedPhrase 97.20 (3.3%) 97.11
(2.8%) -0.1% ( -6% - 6%) 0.931
AndHighLow 1597.01 (5.2%) 1596.89
(4.1%) -0.0% ( -8% - 9%) 0.996
AndHighHigh 64.16 (4.5%) 64.17
(4.0%) 0.0% ( -8% - 8%) 0.995
AndHighMedDayTaxoFacets 91.65 (1.6%) 91.73
(2.4%) 0.1% ( -3% - 4%) 0.894
Fuzzy2 101.73 (2.5%) 101.84
(2.5%) 0.1% ( -4% - 5%) 0.889
AndHighHighDayTaxoFacets 6.90 (3.0%) 6.91
(2.3%) 0.1% ( -4% - 5%) 0.890
HighSloppyPhrase 39.46 (3.9%) 39.54
(3.2%) 0.2% ( -6% - 7%) 0.870
OrHighMedDayTaxoFacets 22.70 (4.9%) 22.74
(5.7%) 0.2% ( -9% - 11%) 0.899
MedSpanNear 117.54 (2.8%) 117.84
(1.9%) 0.3% ( -4% - 5%) 0.735
Fuzzy1 189.31 (2.7%) 189.83
(2.8%) 0.3% ( -5% - 5%) 0.753
HighPhrase 92.14 (2.9%) 92.41
(2.5%) 0.3% ( -4% - 5%) 0.731
LowSpanNear 111.85 (2.1%) 112.29
(2.6%) 0.4% ( -4% - 5%) 0.594
OrHighLow 175.33 (4.6%) 176.33
(4.0%) 0.6% ( -7% - 9%) 0.675
LowSloppyPhrase 60.26 (3.4%) 60.65
(3.0%) 0.7% ( -5% - 7%) 0.522
MedSloppyPhrase 102.93 (3.3%) 103.90
(3.2%) 0.9% ( -5% - 7%) 0.362
MedIntervalsOrdered 18.60 (5.5%) 18.82
(6.5%) 1.2% ( -10% - 13%) 0.542
HighIntervalsOrdered 5.94 (8.4%) 6.01
(8.9%) 1.2% ( -14% - 20%) 0.671
IntNRQ 183.09 (7.7%) 185.85
(5.7%) 1.5% ( -11% - 16%) 0.482
HighTermMonthSort 3709.41 (5.6%) 3771.76
(6.9%) 1.7% ( -10% - 15%) 0.400
HighTermDayOfYearSort 492.50 (5.3%) 501.64
(3.4%) 1.9% ( -6% - 11%) 0.185
LowIntervalsOrdered 252.81 (8.1%) 258.30
(9.8%) 2.2% ( -14% - 21%) 0.446
LowTerm 1057.85 (6.4%) 1081.16
(8.1%) 2.2% ( -11% - 17%) 0.342
TermDTSort 253.74 (5.9%) 259.66
(4.5%) 2.3% ( -7% - 13%) 0.161
BrowseDateSSDVFacets 5.08 (21.2%) 5.28
(24.8%) 3.9% ( -34% - 63%) 0.589
BrowseDayOfYearSSDVFacets 26.14 (25.7%) 27.21
(34.3%) 4.1% ( -44% - 86%) 0.670
BrowseMonthSSDVFacets 26.25 (23.4%) 27.98
(30.7%) 6.6% ( -38% - 79%) 0.446
BrowseDayOfYearTaxoFacets 39.48 (35.1%) 42.26
(34.3%) 7.0% ( -46% - 117%) 0.522
BrowseDateTaxoFacets 39.49 (35.1%) 42.32
(34.3%) 7.2% ( -46% - 117%) 0.513
AndHighNotQuarter 147.80 (1.7%) 389.73
(9.5%) 163.7% ( 150% - 177%) 0.000
AndMedNotQuarter 112.31 (1.1%) 430.97
(13.3%) 283.7% ( 266% - 301%) 0.000
```
I feel the result so far looks promising. With regard to the question I
raised earlier:
> As Lucene does a lot of two phase iterations, and two phase iterator's
approximation may provide a superset of the actual matches. If we were to use
this API to find and ignore / skip over a bunch of doc ids from approximation,
wouldn't the result be inaccurate?
maybe one "solution" could be to mark this API as expert only, and warn that
any concrete implementation should provide an exact range rather than
approximated range?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]