[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

via GitHub Wed, 08 Mar 2023 20:21:14 -0800


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1461257536


   Thanks @jpountz for the review and comment!
   
   >Did you manage to observe some speedups with this change? 
   
   So far I have only able to run `wikimedium10m` and see the implementation 
has around -10% slow down (listed below) for full text boolean queries 
`OrXXXNotYYYY` due to changes in `ReqExclScorer` and `Lucene90PostingsReader` 
(and the facet ones don't seems to exercise the changes and should be just 
random fluctuation). I'm currently still searching for any existing 
benchmarking tasks that can measure these targeted use cases:
   
   > Is it actually common to have long runs of matches? For full-text indexes, 
maybe not so much, only stop words may have runs of adjacent matches. For 
string fields, this may happen if the field has a default value that is the 
value of most documents in the collection. Also it's possible for users to use 
index sorting in order to cluster similar documents together, which increases 
the likelihood to have long runs of adjacent matches.
   
   Do you have any pointer which benchmark task I could potentially use? If 
there isn't one available, I could try to add some next. 
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
        BrowseRandomLabelTaxoFacets       39.06     (49.1%)       33.42     
(45.6%)  -14.4% ( -73% -  157%) 0.335
               BrowseDateTaxoFacets       34.22     (29.5%)       30.49     
(27.9%)  -10.9% ( -52% -   65%) 0.229
          BrowseDayOfYearTaxoFacets       34.30     (29.4%)       30.57     
(27.9%)  -10.9% ( -52% -   65%) 0.230
                      OrNotHighHigh      426.33      (2.7%)      388.27      
(1.6%)   -8.9% ( -12% -   -4%) 0.000
                      OrHighNotHigh      621.42      (2.7%)      573.32      
(2.0%)   -7.7% ( -12% -   -3%) 0.000
                       OrHighNotMed      616.08      (3.8%)      573.20      
(2.7%)   -7.0% ( -13% -    0%) 0.000
                       OrHighNotLow      562.98      (4.0%)      525.80      
(3.3%)   -6.6% ( -13% -    0%) 0.000
                       OrNotHighMed      712.40      (2.6%)      672.88      
(2.4%)   -5.5% ( -10% -    0%) 0.000
               HighTermTitleBDVSort       19.04      (7.9%)       18.73      
(8.3%)   -1.6% ( -16% -   15%) 0.534
               HighIntervalsOrdered        1.89     (12.3%)        1.86     
(15.0%)   -1.6% ( -25% -   29%) 0.719
              BrowseMonthTaxoFacets       32.23     (33.9%)       31.77     
(33.6%)   -1.4% ( -51% -  100%) 0.895
                       OrNotHighLow     1719.50      (3.8%)     1696.85      
(4.6%)   -1.3% (  -9% -    7%) 0.326
                  HighTermTitleSort      202.79      (2.9%)      200.20      
(2.8%)   -1.3% (  -6% -    4%) 0.155
                        AndHighHigh       51.74      (5.8%)       51.08      
(5.4%)   -1.3% ( -11% -   10%) 0.475
                             Fuzzy1       59.04      (2.7%)       58.36      
(3.2%)   -1.2% (  -6% -    4%) 0.214
                            MedTerm     1364.68      (4.4%)     1349.31      
(3.4%)   -1.1% (  -8% -    6%) 0.362
                           Wildcard      314.79      (2.8%)      311.35      
(3.5%)   -1.1% (  -7% -    5%) 0.277
                            LowTerm     2087.86      (3.2%)     2065.24      
(3.8%)   -1.1% (  -7% -    6%) 0.334
                MedIntervalsOrdered       22.66      (8.6%)       22.42     
(10.4%)   -1.0% ( -18% -   19%) 0.730
                           PKLookup      331.54      (2.9%)      328.12      
(2.6%)   -1.0% (  -6% -    4%) 0.242
                LowIntervalsOrdered      161.90      (9.5%)      160.23     
(11.6%)   -1.0% ( -20% -   22%) 0.758
                             Fuzzy2      100.43      (1.5%)       99.40      
(3.0%)   -1.0% (  -5% -    3%) 0.169
                            Respell       88.01      (2.0%)       87.27      
(2.4%)   -0.8% (  -5% -    3%) 0.223
               BrowseDateSSDVFacets        4.89     (21.4%)        4.85     
(20.2%)   -0.8% ( -34% -   51%) 0.905
        BrowseRandomLabelSSDVFacets       19.23      (7.1%)       19.09      
(6.2%)   -0.7% ( -13% -   13%) 0.728
                         AndHighMed      114.49      (5.6%)      113.77      
(5.0%)   -0.6% ( -10% -   10%) 0.708
                            Prefix3      376.91      (1.4%)      374.65      
(2.5%)   -0.6% (  -4% -    3%) 0.348
                  HighTermMonthSort     4250.83      (4.2%)     4227.31      
(3.6%)   -0.6% (  -7% -    7%) 0.653
                          OrHighMed      209.50      (6.2%)      208.61      
(3.4%)   -0.4% (  -9% -    9%) 0.787
                          LowPhrase       89.33      (3.0%)       88.96      
(2.2%)   -0.4% (  -5% -    4%) 0.623
          BrowseDayOfYearSSDVFacets       24.82     (10.9%)       24.75     
(11.2%)   -0.3% ( -20% -   24%) 0.940
            AndHighMedDayTaxoFacets      158.35      (1.6%)      158.08      
(1.8%)   -0.2% (  -3% -    3%) 0.756
                           HighTerm     2076.24      (3.7%)     2074.83      
(2.9%)   -0.1% (  -6% -    6%) 0.949
           AndHighHighDayTaxoFacets       14.81      (2.4%)       14.81      
(2.9%)   -0.0% (  -5% -    5%) 0.992
                       HighSpanNear       11.02      (2.0%)       11.02      
(2.4%)    0.0% (  -4% -    4%) 0.951
                        LowSpanNear      178.01      (1.7%)      178.17      
(1.8%)    0.1% (  -3% -    3%) 0.864
                          OrHighLow      473.27      (6.2%)      473.94      
(3.3%)    0.1% (  -8% -   10%) 0.929
                         TermDTSort      230.93      (4.7%)      231.36      
(3.3%)    0.2% (  -7% -    8%) 0.885
                    MedSloppyPhrase       20.69      (3.0%)       20.76      
(3.0%)    0.3% (  -5% -    6%) 0.721
                        MedSpanNear       80.38      (2.2%)       80.66      
(2.1%)    0.3% (  -3% -    4%) 0.618
                          MedPhrase       53.03      (1.8%)       53.23      
(1.8%)    0.4% (  -3% -    3%) 0.520
                         AndHighLow     2127.04      (4.1%)     2136.55      
(3.4%)    0.4% (  -6% -    8%) 0.706
              HighTermDayOfYearSort      594.24      (6.9%)      597.03      
(6.3%)    0.5% ( -11% -   14%) 0.822
                         OrHighHigh       53.90      (5.7%)       54.21      
(4.0%)    0.6% (  -8% -   10%) 0.709
               MedTermDayTaxoFacets       41.47      (1.1%)       41.75      
(2.8%)    0.7% (  -3% -    4%) 0.311
                         HighPhrase      119.60      (2.0%)      120.53      
(1.8%)    0.8% (  -2% -    4%) 0.195
                    LowSloppyPhrase      166.00      (5.7%)      167.36      
(5.5%)    0.8% (  -9% -   12%) 0.644
                   HighSloppyPhrase       43.02      (5.5%)       43.40      
(5.2%)    0.9% (  -9% -   12%) 0.610
              BrowseMonthSSDVFacets       24.48      (8.6%)       24.85     
(11.6%)    1.5% ( -17% -   23%) 0.638
             OrHighMedDayTaxoFacets        7.71      (3.7%)        7.86      
(6.0%)    1.9% (  -7% -   12%) 0.218
                             IntNRQ      115.47     (14.3%)      118.04     
(13.2%)    2.2% ( -22% -   34%) 0.610
   
   ```
   
   >You explored implementing this new API in several different places: 
BitSetIterator, doc-value iterator, postings, etc. and it's already a bit 
exhausting to review and will get worse when we add more tests. I think it 
would be helpful if we focused on a single thing for the initial PR that 
focuses on proving that this API is a good addition, adds good testing, and 
then implement the new API on other implementations of DocIdSetIterator in 
follow-up PRs.
   
   For sure. Once I'm able to benchmark this and observe good speed up & we are 
good with the API, I will break up this PR into smaller pieces.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

Reply via email to