zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1477253989
> I have some suggestions for moving this PR forward: > > * Enhance CheckIndex to check that peekNextNonMatcthingDocID is correct. > * Enhance AssertingScorer to check that peekNextNonMatchingDocID is only called when the iterator is positioned. Also check return values. > * Revert changes to bitsets and doc-value iterators, let's only focus on postings and negations to keep this initial PR simple? We'll add support for bitsets and doc-value iterators in follow-ups? Maybe we could consider conjunctions too for this initial PR, which are far more common than negations in my experience. > * See if we can leverage skip data to skip over longer ranges of doc IDs with postings. > * See if we can reduce the slowdown on `OrNotHighHigh` and other negations when the optimization does not kick in. > * A `quarter` field is a bit extreme, see if we can also observe good speedups with something less extreme like the `month` field? Thanks @jpountz for the suggestions! The plan makes sense to me. I have pushed a commit https://github.com/apache/lucene/pull/12194/commits/b2184995df34cb1e42ff987c37c7c81ecb55aca6 to revert changes to bitset, doc values and `ReqExclScorer` (since it leveraged approximation and also negatively impacted `OrNotXY` queries), as well as benchmark the implementation via `month`. The latest benchmark results look something like these: Tasks: (had to use `monthPostings` here since `month` was already taken as doc value field` ``` AndHighNotMonth: +last -monthPostings:jan # freq=830278 AndHighNotMonth: +united -monthPostings:feb # freq=1185528 AndHighNotMonth: +year -monthPostings:mar # freq=1098425 AndHighNotMonth: +its -monthPostings:apr # freq=1160703 AndHighNotMonth: +but -monthPostings:may # freq=1484398 AndMedNotMonth: +mostly -monthPostings:jun # freq=89401 AndMedNotMonth: +interview -monthPostings:jul # freq=94736 AndMedNotMonth: +9 -monthPostings:aug # freq=541405 AndMedNotMonth: +hard -monthPostings:sep # freq=92045 AndMedNotMonth: +bay -monthPostings:oct # freq=117167 ``` Results: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighMedDayTaxoFacets 18.68 (5.2%) 18.11 (5.5%) -3.1% ( -13% - 8%) 0.069 LowIntervalsOrdered 216.28 (8.6%) 211.46 (9.9%) -2.2% ( -19% - 17%) 0.447 MedIntervalsOrdered 84.99 (11.5%) 83.27 (12.6%) -2.0% ( -23% - 25%) 0.597 HighIntervalsOrdered 27.27 (8.5%) 26.73 (9.0%) -2.0% ( -17% - 16%) 0.468 HighTermDayOfYearSort 464.57 (3.5%) 455.78 (6.3%) -1.9% ( -11% - 8%) 0.237 BrowseDateTaxoFacets 18.29 (3.0%) 18.11 (3.1%) -1.0% ( -6% - 5%) 0.304 HighTermTitleSort 203.48 (2.1%) 201.63 (2.1%) -0.9% ( -5% - 3%) 0.174 BrowseDayOfYearSSDVFacets 14.53 (7.4%) 14.40 (6.0%) -0.9% ( -13% - 13%) 0.674 LowTerm 1262.59 (5.6%) 1253.06 (6.1%) -0.8% ( -11% - 11%) 0.684 MedTermDayTaxoFacets 29.75 (2.3%) 29.56 (1.4%) -0.7% ( -4% - 3%) 0.283 AndHighMed 366.10 (4.4%) 363.76 (5.3%) -0.6% ( -9% - 9%) 0.676 HighPhrase 228.16 (2.5%) 227.25 (2.8%) -0.4% ( -5% - 4%) 0.633 PKLookup 314.34 (3.1%) 313.17 (3.7%) -0.4% ( -6% - 6%) 0.730 AndHighMedDayTaxoFacets 57.70 (2.9%) 57.52 (2.4%) -0.3% ( -5% - 5%) 0.714 TermDTSort 211.54 (3.5%) 211.05 (5.1%) -0.2% ( -8% - 8%) 0.869 AndHighHighDayTaxoFacets 26.74 (1.7%) 26.71 (1.5%) -0.1% ( -3% - 3%) 0.877 LowPhrase 20.75 (2.6%) 20.74 (2.7%) -0.0% ( -5% - 5%) 0.958 MedPhrase 117.30 (2.4%) 117.36 (1.9%) 0.0% ( -4% - 4%) 0.946 OrHighMed 202.92 (5.4%) 203.12 (4.6%) 0.1% ( -9% - 10%) 0.950 AndHighHigh 74.03 (3.9%) 74.10 (4.6%) 0.1% ( -8% - 8%) 0.940 LowSpanNear 45.31 (1.3%) 45.45 (0.8%) 0.3% ( -1% - 2%) 0.371 Wildcard 190.90 (3.1%) 191.52 (3.3%) 0.3% ( -5% - 6%) 0.748 BrowseDayOfYearTaxoFacets 14.14 (2.4%) 14.20 (3.6%) 0.4% ( -5% - 6%) 0.687 BrowseDateSSDVFacets 4.98 (17.3%) 5.00 (17.5%) 0.5% ( -29% - 42%) 0.927 Respell 89.97 (2.0%) 90.42 (1.9%) 0.5% ( -3% - 4%) 0.424 IntNRQ 85.40 (19.5%) 85.93 (18.0%) 0.6% ( -30% - 47%) 0.916 OrHighHigh 45.64 (5.8%) 45.93 (3.7%) 0.6% ( -8% - 10%) 0.679 HighTermTitleBDVSort 37.81 (1.0%) 38.09 (2.0%) 0.7% ( -2% - 3%) 0.131 MedSloppyPhrase 19.19 (3.2%) 19.34 (3.9%) 0.8% ( -6% - 8%) 0.485 HighSpanNear 10.28 (1.6%) 10.37 (1.2%) 0.8% ( -1% - 3%) 0.069 OrHighLow 914.57 (4.8%) 922.44 (4.1%) 0.9% ( -7% - 10%) 0.543 HighSloppyPhrase 8.44 (3.5%) 8.52 (4.1%) 0.9% ( -6% - 8%) 0.452 OrNotHighLow 1686.77 (3.6%) 1702.27 (3.7%) 0.9% ( -6% - 8%) 0.425 LowSloppyPhrase 240.50 (2.0%) 242.74 (2.9%) 0.9% ( -3% - 5%) 0.239 Fuzzy2 24.44 (1.7%) 24.68 (1.1%) 1.0% ( -1% - 3%) 0.030 HighTermMonthSort 3940.53 (4.8%) 3979.49 (4.7%) 1.0% ( -8% - 10%) 0.507 OrHighNotHigh 510.95 (4.1%) 516.14 (3.5%) 1.0% ( -6% - 8%) 0.398 Fuzzy1 127.91 (1.6%) 129.29 (1.3%) 1.1% ( -1% - 4%) 0.022 OrNotHighMed 781.43 (3.5%) 790.15 (3.7%) 1.1% ( -5% - 8%) 0.324 OrNotHighHigh 525.59 (3.3%) 532.45 (3.1%) 1.3% ( -4% - 8%) 0.200 MedTerm 1202.24 (3.7%) 1220.51 (4.8%) 1.5% ( -6% - 10%) 0.263 OrHighNotMed 569.55 (4.2%) 578.34 (3.7%) 1.5% ( -6% - 9%) 0.216 MedSpanNear 136.46 (2.2%) 138.57 (1.9%) 1.5% ( -2% - 5%) 0.019 OrHighNotLow 676.93 (4.2%) 687.46 (4.0%) 1.6% ( -6% - 10%) 0.233 HighTerm 946.83 (4.1%) 961.82 (4.9%) 1.6% ( -7% - 10%) 0.266 BrowseMonthSSDVFacets 20.58 (12.2%) 20.94 (21.8%) 1.7% ( -28% - 40%) 0.756 AndHighLow 1538.93 (5.6%) 1569.90 (5.7%) 2.0% ( -8% - 14%) 0.263 BrowseMonthTaxoFacets 16.67 (3.9%) 17.04 (4.1%) 2.2% ( -5% - 10%) 0.079 Prefix3 1668.64 (3.7%) 1709.38 (4.6%) 2.4% ( -5% - 11%) 0.064 AndMedNotMonth 879.42 (4.1%) 1142.62 (5.9%) 29.9% ( 19% - 41%) 0.000 AndHighNotMonth 373.72 (4.9%) 600.07 (7.2%) 60.6% ( 46% - 76%) 0.000 ``` ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseMonthSSDVFacets 22.81 (23.8%) 21.14 (15.0%) -7.3% ( -37% - 41%) 0.244 BrowseMonthTaxoFacets 17.09 (3.4%) 16.89 (4.2%) -1.2% ( -8% - 6%) 0.332 HighTermDayOfYearSort 532.86 (5.1%) 528.05 (6.2%) -0.9% ( -11% - 10%) 0.614 BrowseDayOfYearSSDVFacets 14.73 (3.9%) 14.64 (7.2%) -0.6% ( -11% - 10%) 0.732 TermDTSort 226.03 (4.8%) 225.35 (4.8%) -0.3% ( -9% - 9%) 0.842 PKLookup 312.28 (3.6%) 311.52 (2.5%) -0.2% ( -6% - 6%) 0.805 OrHighNotMed 557.57 (5.3%) 556.42 (3.9%) -0.2% ( -8% - 9%) 0.888 OrHighNotHigh 443.79 (4.5%) 443.12 (3.8%) -0.2% ( -8% - 8%) 0.910 IntNRQ 119.18 (2.9%) 119.05 (3.4%) -0.1% ( -6% - 6%) 0.910 AndHighMedDayTaxoFacets 154.34 (3.0%) 154.17 (4.1%) -0.1% ( -6% - 7%) 0.921 HighTerm 642.18 (4.7%) 641.61 (3.0%) -0.1% ( -7% - 8%) 0.944 AndHighHighDayTaxoFacets 9.79 (3.5%) 9.79 (3.6%) -0.0% ( -6% - 7%) 0.970 MedTerm 927.37 (3.9%) 927.88 (3.5%) 0.1% ( -7% - 7%) 0.963 Fuzzy2 135.91 (2.6%) 136.03 (2.7%) 0.1% ( -5% - 5%) 0.911 MedPhrase 191.19 (2.5%) 191.39 (2.7%) 0.1% ( -4% - 5%) 0.900 Respell 97.25 (2.1%) 97.38 (1.4%) 0.1% ( -3% - 3%) 0.820 OrHighNotLow 683.76 (4.4%) 684.68 (4.2%) 0.1% ( -8% - 9%) 0.921 Fuzzy1 132.14 (2.6%) 132.37 (2.7%) 0.2% ( -4% - 5%) 0.835 OrNotHighMed 634.06 (3.8%) 635.26 (4.9%) 0.2% ( -8% - 9%) 0.892 Wildcard 247.85 (3.8%) 248.34 (2.1%) 0.2% ( -5% - 6%) 0.837 BrowseDayOfYearTaxoFacets 13.96 (1.8%) 13.99 (1.5%) 0.2% ( -3% - 3%) 0.688 HighPhrase 203.79 (2.1%) 204.33 (2.5%) 0.3% ( -4% - 4%) 0.713 LowPhrase 153.51 (1.9%) 154.15 (2.6%) 0.4% ( -4% - 5%) 0.563 LowIntervalsOrdered 146.20 (3.2%) 146.81 (3.4%) 0.4% ( -5% - 7%) 0.688 HighTermTitleBDVSort 32.43 (1.8%) 32.57 (3.1%) 0.4% ( -4% - 5%) 0.587 MedTermDayTaxoFacets 73.26 (1.3%) 73.62 (1.7%) 0.5% ( -2% - 3%) 0.307 MedSpanNear 99.50 (1.8%) 100.01 (1.8%) 0.5% ( -2% - 4%) 0.365 LowTerm 1091.52 (5.3%) 1097.52 (4.9%) 0.6% ( -9% - 11%) 0.733 LowSpanNear 278.50 (2.0%) 280.09 (2.0%) 0.6% ( -3% - 4%) 0.369 HighIntervalsOrdered 36.70 (5.9%) 36.92 (6.5%) 0.6% ( -11% - 13%) 0.758 OrHighMedDayTaxoFacets 6.96 (3.4%) 7.00 (2.9%) 0.6% ( -5% - 7%) 0.518 LowSloppyPhrase 29.02 (3.8%) 29.23 (3.8%) 0.7% ( -6% - 8%) 0.543 OrHighLow 601.05 (4.1%) 605.60 (5.1%) 0.8% ( -8% - 10%) 0.605 OrNotHighHigh 530.94 (4.3%) 535.17 (3.4%) 0.8% ( -6% - 8%) 0.515 Prefix3 302.43 (4.2%) 304.98 (5.8%) 0.8% ( -8% - 11%) 0.599 HighSloppyPhrase 56.77 (3.1%) 57.25 (3.6%) 0.8% ( -5% - 7%) 0.429 HighSpanNear 20.22 (1.2%) 20.39 (1.7%) 0.9% ( -2% - 3%) 0.070 MedIntervalsOrdered 103.06 (4.5%) 104.16 (5.0%) 1.1% ( -7% - 10%) 0.473 OrHighMed 276.47 (2.9%) 279.58 (4.4%) 1.1% ( -5% - 8%) 0.338 OrHighHigh 51.21 (3.4%) 51.80 (4.2%) 1.1% ( -6% - 9%) 0.341 AndHighMed 230.11 (3.5%) 232.91 (4.9%) 1.2% ( -6% - 9%) 0.365 AndHighLow 2054.63 (4.8%) 2080.21 (4.8%) 1.2% ( -7% - 11%) 0.413 MedSloppyPhrase 89.55 (4.0%) 90.71 (5.1%) 1.3% ( -7% - 10%) 0.373 HighTermTitleSort 193.20 (3.1%) 195.79 (4.2%) 1.3% ( -5% - 8%) 0.256 AndHighHigh 67.40 (3.4%) 68.31 (4.4%) 1.3% ( -6% - 9%) 0.276 BrowseDateTaxoFacets 18.16 (3.0%) 18.42 (3.1%) 1.5% ( -4% - 7%) 0.125 HighTermMonthSort 3913.52 (4.4%) 3974.57 (6.4%) 1.6% ( -8% - 12%) 0.369 OrNotHighLow 1855.59 (5.1%) 1899.00 (5.3%) 2.3% ( -7% - 13%) 0.155 BrowseDateSSDVFacets 4.61 (19.0%) 5.32 (25.4%) 15.5% ( -24% - 73%) 0.029 AndMedNotMonth 865.67 (2.8%) 1088.95 (4.8%) 25.8% ( 17% - 34%) 0.000 AndHighNotMonth 75.79 (0.6%) 275.41 (14.5%) 263.4% ( 246% - 280%) 0.000 ``` I'll work on the rest in the next few days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
