[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

Adrien Grand (Jira) Wed, 27 Jul 2022 01:50:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571807#comment-17571807
 ]


Adrien Grand commented on LUCENE-10633:
---------------------------------------

The PR is ready for review now if someone is interested in having a look. I 
made an improvement for the very sparse case, so that after collecting 
{{numHits}} matches, the collector would tell the query to only look at 
documents that have a value for the sort field.

One assumption that this change makes is that terms are encoded exactly the 
same way in the terms index and in the doc-values terms dictionary. I think 
it's a fine assumption, but wanted to make it explicit because this 
optimization will lead to runtime errors if this assumption isn't met. This is 
the same assumption that we are already making today when sorting numeric 
fields and using the points index to dynamically prune irrelevant hits.

I ran luceneutil again to verify performance is still good:

{noformat}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                HighSloppyPhrase       11.46      (4.3%)       11.19      
(5.3%)   -2.4% ( -11% -    7%) 0.120
                         Prefix3       53.30     (16.7%)       52.06     
(16.8%)   -2.3% ( -30% -   37%) 0.659
            BrowseDateSSDVFacets        5.23     (11.1%)        5.13     
(13.5%)   -1.9% ( -23% -   25%) 0.632
       BrowseDayOfYearSSDVFacets       20.33      (7.6%)       19.96      
(8.6%)   -1.9% ( -16% -   15%) 0.470
           BrowseMonthTaxoFacets       28.62     (12.0%)       28.11      
(7.8%)   -1.8% ( -19% -   20%) 0.582
                    OrHighNotLow     1357.76      (6.3%)     1334.12      
(4.8%)   -1.7% ( -12% -    9%) 0.325
                    OrHighNotMed     1568.25      (4.3%)     1541.21      
(4.8%)   -1.7% ( -10% -    7%) 0.232
                         MedTerm     2422.95      (5.2%)     2381.38      
(4.6%)   -1.7% ( -10% -    8%) 0.269
                        HighTerm     1736.81      (6.5%)     1710.26      
(5.6%)   -1.5% ( -12% -   11%) 0.426
                 MedSloppyPhrase       62.45      (3.4%)       61.59      
(4.1%)   -1.4% (  -8% -    6%) 0.249
                   OrNotHighHigh      931.81      (5.4%)      919.74      
(4.4%)   -1.3% ( -10% -    8%) 0.403
                      OrHighHigh       58.41      (5.3%)       57.65      
(4.1%)   -1.3% ( -10% -    8%) 0.388
                    OrNotHighMed     1179.51      (3.0%)     1168.53      
(3.2%)   -0.9% (  -6% -    5%) 0.338
     BrowseRandomLabelSSDVFacets       14.52      (1.9%)       14.40      
(1.9%)   -0.8% (  -4% -    3%) 0.186
                         LowTerm     1589.67      (3.6%)     1579.95      
(4.6%)   -0.6% (  -8% -    7%) 0.642
            MedTermDayTaxoFacets       52.00      (4.3%)       51.70      
(4.3%)   -0.6% (  -8% -    8%) 0.672
                   OrHighNotHigh     1008.27      (5.9%)     1002.78      
(5.1%)   -0.5% ( -10% -   11%) 0.756
             LowIntervalsOrdered       11.03      (4.8%)       10.98      
(4.4%)   -0.5% (  -9% -    9%) 0.724
          OrHighMedDayTaxoFacets       22.72      (3.5%)       22.64      
(3.1%)   -0.4% (  -6% -    6%) 0.718
                       OrHighLow      899.20      (3.3%)      896.35      
(3.0%)   -0.3% (  -6% -    6%) 0.750
             MedIntervalsOrdered       43.37      (3.6%)       43.25      
(3.7%)   -0.3% (  -7% -    7%) 0.799
            HighIntervalsOrdered       24.44      (5.3%)       24.37      
(5.5%)   -0.3% ( -10% -   11%) 0.864
                    OrNotHighLow     1448.52      (4.0%)     1446.40      
(3.5%)   -0.1% (  -7% -    7%) 0.901
                     LowSpanNear       85.70      (2.4%)       85.59      
(2.2%)   -0.1% (  -4% -    4%) 0.851
                      AndHighLow     1043.29      (5.2%)     1042.26      
(3.9%)   -0.1% (  -8% -    9%) 0.946
                        PKLookup      236.83      (1.4%)      236.69      
(2.2%)   -0.1% (  -3% -    3%) 0.919
            HighTermTitleBDVSort       25.03      (3.5%)       25.02      
(2.6%)   -0.0% (  -5% -    6%) 0.977
                        Wildcard      156.78      (1.9%)      156.93      
(1.8%)    0.1% (  -3% -    3%) 0.877
                     MedSpanNear      214.11      (4.2%)      214.32      
(2.9%)    0.1% (  -6% -    7%) 0.929
                          Fuzzy1      118.50      (1.2%)      118.67      
(0.9%)    0.1% (  -1% -    2%) 0.664
                         Respell       59.34      (1.0%)       59.43      
(0.8%)    0.1% (  -1% -    2%) 0.630
                          Fuzzy2      115.77      (1.1%)      116.01      
(1.1%)    0.2% (  -1% -    2%) 0.549
                 LowSloppyPhrase       89.17      (2.6%)       89.38      
(2.6%)    0.2% (  -4% -    5%) 0.771
                    HighSpanNear       31.18      (4.1%)       31.28      
(3.2%)    0.3% (  -6% -    8%) 0.769
                       OrHighMed      252.02      (3.5%)      252.99      
(2.5%)    0.4% (  -5% -    6%) 0.692
         AndHighMedDayTaxoFacets      151.31      (2.5%)      152.01      
(1.9%)    0.5% (  -3% -    5%) 0.511
                      HighPhrase      369.49      (3.8%)      371.89      
(3.3%)    0.6% (  -6% -    8%) 0.564
                       LowPhrase       61.86      (3.6%)       62.30      
(2.6%)    0.7% (  -5% -    7%) 0.475
                      AndHighMed      227.16      (4.6%)      228.89      
(5.3%)    0.8% (  -8% -   11%) 0.626
                      TermDTSort      826.30      (2.0%)      833.64      
(1.7%)    0.9% (  -2% -    4%) 0.139
        AndHighHighDayTaxoFacets       24.63      (3.5%)       24.89      
(4.2%)    1.0% (  -6% -    9%) 0.400
                       MedPhrase      123.13      (3.4%)      124.49      
(2.5%)    1.1% (  -4% -    7%) 0.243
                          IntNRQ      128.88      (4.9%)      130.32      
(3.6%)    1.1% (  -7% -   10%) 0.410
           HighTermDayOfYearSort     1443.98      (2.1%)     1461.31      
(2.0%)    1.2% (  -2% -    5%) 0.063
                     AndHighHigh       73.59      (4.1%)       74.62      
(5.5%)    1.4% (  -7% -   11%) 0.363
     BrowseRandomLabelTaxoFacets       35.66     (11.8%)       36.30      
(6.0%)    1.8% ( -14% -   22%) 0.544
       BrowseDayOfYearTaxoFacets       44.76     (13.7%)       45.89      
(4.5%)    2.5% ( -13% -   24%) 0.434
            BrowseDateTaxoFacets       43.26     (13.4%)       44.53      
(4.8%)    2.9% ( -13% -   24%) 0.354
           BrowseMonthSSDVFacets       21.46      (6.7%)       22.22     
(10.8%)    3.6% ( -13% -   22%) 0.212
               HighTermTitleSort      115.84      (9.7%)      544.03     
(15.6%)  369.6% ( 313% -  437%) 0.000
               HighTermMonthSort      153.92      (9.5%)     3899.90    
(100.4%) 2433.8% (2122% - 2810%) 0.000
{noformat}

If you wonder why the speedup for HighTermTitleSort is lower now, this is 
because the tasks file for wikimedium10m didn't include queries that sort on 
the title field previously, so I added one query on the HighTerm that had the 
highest document frequency. Now wikimedium10m has sorting tasks for all 
HighTerms, including some that have a significantly lower document frequency 
(which is why the baseline is also faster compared to the previous run, fewer 
hits to sort), and this optimization works better with queries that have lots 
of matches. For queries that don't match many documents on sort fields that 
have a high cardinality, this optimization may even make queries a bit slower. 
Here's the performance difference for one of the low-frequency terms 
(`fantasy`):

{noformat}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                LowTermTitleSort      434.21      (7.0%)      427.76      
(8.0%)   -1.5% ( -15% -   14%) 0.534
                LowTermMonthSort      770.08      (5.9%)     1300.48     
(21.2%)   68.9% (  39% -  102%) 0.000
{noformat}

In general I think it's a good trade-off: the slower queries are getting 
faster, and queries that are already fast may get just a little bit slower.

> Dynamic pruning for queries sorted by SORTED(_SET) field
> --------------------------------------------------------
>
>                 Key: LUCENE-10633
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10633
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits 
> when sorting by a numeric field, by leveraging the points index to skip 
> documents that do not compare better than the top of the priority queue 
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which 
> is disappointing. Could we leverage the terms index to skip hits?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10633) Dynamic pruning for queries sorted by SORTED(_SET) field

Reply via email to