[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

via GitHub Sun, 19 Mar 2023 23:56:51 -0700


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1475711031


   Hi @jpountz,  I was able to create a sorted index with new low-cardinality 
field `quarter`, run some new benchmark tasks like below, and see substantial 
improvement for the new tasks (with around -7% impacts to not-long-running 
terms):
   
   ```
   AndHighNotQuarter: +last -quarter:q1 #  freq=830278
   AndHighNotQuarter: +united -quarter:q2 #  freq=1185528
   AndHighNotQuarter: +year -quarter:q3 #  freq=1098425
   AndHighNotQuarter: +its -quarter:q4 #  freq=1160703
   AndHighNotQuarter: +but -quarter:q1 #  freq=1484398
   AndMedNotQuarter: +mostly -quarter:q2 #  freq=89401
   AndMedNotQuarter: +interview -quarter:q3 #  freq=94736
   AndMedNotQuarter: +9 -quarter:q4 #  freq=541405
   AndMedNotQuarter: +hard -quarter:q1 #  freq=92045
   AndMedNotQuarter: +bay -quarter:q2 #  freq=117167
   ```
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                      OrNotHighHigh      574.77      (3.1%)      535.14      
(3.2%)   -6.9% ( -12% -    0%) 0.000
                      OrHighNotHigh      423.02      (3.6%)      396.20      
(3.6%)   -6.3% ( -13% -    0%) 0.000
                       OrHighNotMed      746.53      (3.7%)      702.90      
(4.5%)   -5.8% ( -13% -    2%) 0.000
                       OrHighNotLow      750.04      (4.4%)      712.78      
(4.5%)   -5.0% ( -13% -    4%) 0.000
                       OrNotHighMed      653.88      (3.0%)      624.44      
(3.9%)   -4.5% ( -11% -    2%) 0.000
                       OrNotHighLow     1205.35      (5.4%)     1185.91      
(5.8%)   -1.6% ( -12% -   10%) 0.363
                           PKLookup      269.07      (2.4%)      266.45      
(3.0%)   -1.0% (  -6% -    4%) 0.259
                         AndHighMed      146.24      (6.4%)      144.93      
(6.1%)   -0.9% ( -12% -   12%) 0.649
                           Wildcard      220.71      (3.3%)      218.75      
(2.7%)   -0.9% (  -6% -    5%) 0.352
                  HighTermTitleSort      148.58      (3.3%)      147.26      
(2.7%)   -0.9% (  -6% -    5%) 0.347
               HighTermTitleBDVSort       35.47      (2.5%)       35.18      
(2.0%)   -0.8% (  -5% -    3%) 0.247
              BrowseMonthTaxoFacets       35.20     (35.1%)       34.93     
(38.6%)   -0.8% ( -55% -  112%) 0.948
                           HighTerm      714.46      (4.1%)      710.25      
(3.5%)   -0.6% (  -7% -    7%) 0.626
                            MedTerm     1063.86      (4.3%)     1058.95      
(3.4%)   -0.5% (  -7% -    7%) 0.707
                          OrHighMed      192.38      (3.0%)      191.49      
(2.6%)   -0.5% (  -5% -    5%) 0.602
                            Prefix3      259.99      (5.9%)      258.85      
(6.3%)   -0.4% ( -11% -   12%) 0.821
                         OrHighHigh       39.21      (2.9%)       39.06      
(2.2%)   -0.4% (  -5% -    4%) 0.639
                          LowPhrase       65.71      (2.5%)       65.53      
(2.6%)   -0.3% (  -5% -    4%) 0.723
                            Respell      105.97      (2.6%)      105.72      
(1.7%)   -0.2% (  -4% -    4%) 0.740
               MedTermDayTaxoFacets       38.03      (2.7%)       37.98      
(2.2%)   -0.1% (  -4% -    4%) 0.869
                       HighSpanNear       10.67      (2.5%)       10.66      
(2.3%)   -0.1% (  -4% -    4%) 0.865
                          MedPhrase       97.20      (3.3%)       97.11      
(2.8%)   -0.1% (  -6% -    6%) 0.931
                         AndHighLow     1597.01      (5.2%)     1596.89      
(4.1%)   -0.0% (  -8% -    9%) 0.996
                        AndHighHigh       64.16      (4.5%)       64.17      
(4.0%)    0.0% (  -8% -    8%) 0.995
            AndHighMedDayTaxoFacets       91.65      (1.6%)       91.73      
(2.4%)    0.1% (  -3% -    4%) 0.894
                             Fuzzy2      101.73      (2.5%)      101.84      
(2.5%)    0.1% (  -4% -    5%) 0.889
           AndHighHighDayTaxoFacets        6.90      (3.0%)        6.91      
(2.3%)    0.1% (  -4% -    5%) 0.890
                   HighSloppyPhrase       39.46      (3.9%)       39.54      
(3.2%)    0.2% (  -6% -    7%) 0.870
             OrHighMedDayTaxoFacets       22.70      (4.9%)       22.74      
(5.7%)    0.2% (  -9% -   11%) 0.899
                        MedSpanNear      117.54      (2.8%)      117.84      
(1.9%)    0.3% (  -4% -    5%) 0.735
                             Fuzzy1      189.31      (2.7%)      189.83      
(2.8%)    0.3% (  -5% -    5%) 0.753
                         HighPhrase       92.14      (2.9%)       92.41      
(2.5%)    0.3% (  -4% -    5%) 0.731
                        LowSpanNear      111.85      (2.1%)      112.29      
(2.6%)    0.4% (  -4% -    5%) 0.594
                          OrHighLow      175.33      (4.6%)      176.33      
(4.0%)    0.6% (  -7% -    9%) 0.675
                    LowSloppyPhrase       60.26      (3.4%)       60.65      
(3.0%)    0.7% (  -5% -    7%) 0.522
                    MedSloppyPhrase      102.93      (3.3%)      103.90      
(3.2%)    0.9% (  -5% -    7%) 0.362
                MedIntervalsOrdered       18.60      (5.5%)       18.82      
(6.5%)    1.2% ( -10% -   13%) 0.542
               HighIntervalsOrdered        5.94      (8.4%)        6.01      
(8.9%)    1.2% ( -14% -   20%) 0.671
                             IntNRQ      183.09      (7.7%)      185.85      
(5.7%)    1.5% ( -11% -   16%) 0.482
                  HighTermMonthSort     3709.41      (5.6%)     3771.76      
(6.9%)    1.7% ( -10% -   15%) 0.400
              HighTermDayOfYearSort      492.50      (5.3%)      501.64      
(3.4%)    1.9% (  -6% -   11%) 0.185
                LowIntervalsOrdered      252.81      (8.1%)      258.30      
(9.8%)    2.2% ( -14% -   21%) 0.446
                            LowTerm     1057.85      (6.4%)     1081.16      
(8.1%)    2.2% ( -11% -   17%) 0.342
                         TermDTSort      253.74      (5.9%)      259.66      
(4.5%)    2.3% (  -7% -   13%) 0.161
               BrowseDateSSDVFacets        5.08     (21.2%)        5.28     
(24.8%)    3.9% ( -34% -   63%) 0.589
          BrowseDayOfYearSSDVFacets       26.14     (25.7%)       27.21     
(34.3%)    4.1% ( -44% -   86%) 0.670
              BrowseMonthSSDVFacets       26.25     (23.4%)       27.98     
(30.7%)    6.6% ( -38% -   79%) 0.446
          BrowseDayOfYearTaxoFacets       39.48     (35.1%)       42.26     
(34.3%)    7.0% ( -46% -  117%) 0.522
               BrowseDateTaxoFacets       39.49     (35.1%)       42.32     
(34.3%)    7.2% ( -46% -  117%) 0.513
                  AndHighNotQuarter      147.80      (1.7%)      389.73      
(9.5%)  163.7% ( 150% -  177%) 0.000
                   AndMedNotQuarter      112.31      (1.1%)      430.97     
(13.3%)  283.7% ( 266% -  301%) 0.000
   ```
   
   I feel the result so far looks promising. With regard to the question I 
raised earlier:
   
   > As Lucene does a lot of two phase iterations, and two phase iterator's 
approximation may provide a superset of the actual matches. If we were to use 
this API to find and ignore / skip over a bunch of doc ids from approximation, 
wouldn't the result be inaccurate? 
   
   maybe one "solution" could be to mark this API as expert only, and warn that 
any concrete implementation should provide an exact range rather than 
approximated range?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

Reply via email to