[
https://issues.apache.org/jira/browse/LUCENE-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571807#comment-17571807
]
Adrien Grand commented on LUCENE-10633:
---------------------------------------
The PR is ready for review now if someone is interested in having a look. I
made an improvement for the very sparse case, so that after collecting
{{numHits}} matches, the collector would tell the query to only look at
documents that have a value for the sort field.
One assumption that this change makes is that terms are encoded exactly the
same way in the terms index and in the doc-values terms dictionary. I think
it's a fine assumption, but wanted to make it explicit because this
optimization will lead to runtime errors if this assumption isn't met. This is
the same assumption that we are already making today when sorting numeric
fields and using the points index to dynamically prune irrelevant hits.
I ran luceneutil again to verify performance is still good:
{noformat}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff p-value
HighSloppyPhrase 11.46 (4.3%) 11.19
(5.3%) -2.4% ( -11% - 7%) 0.120
Prefix3 53.30 (16.7%) 52.06
(16.8%) -2.3% ( -30% - 37%) 0.659
BrowseDateSSDVFacets 5.23 (11.1%) 5.13
(13.5%) -1.9% ( -23% - 25%) 0.632
BrowseDayOfYearSSDVFacets 20.33 (7.6%) 19.96
(8.6%) -1.9% ( -16% - 15%) 0.470
BrowseMonthTaxoFacets 28.62 (12.0%) 28.11
(7.8%) -1.8% ( -19% - 20%) 0.582
OrHighNotLow 1357.76 (6.3%) 1334.12
(4.8%) -1.7% ( -12% - 9%) 0.325
OrHighNotMed 1568.25 (4.3%) 1541.21
(4.8%) -1.7% ( -10% - 7%) 0.232
MedTerm 2422.95 (5.2%) 2381.38
(4.6%) -1.7% ( -10% - 8%) 0.269
HighTerm 1736.81 (6.5%) 1710.26
(5.6%) -1.5% ( -12% - 11%) 0.426
MedSloppyPhrase 62.45 (3.4%) 61.59
(4.1%) -1.4% ( -8% - 6%) 0.249
OrNotHighHigh 931.81 (5.4%) 919.74
(4.4%) -1.3% ( -10% - 8%) 0.403
OrHighHigh 58.41 (5.3%) 57.65
(4.1%) -1.3% ( -10% - 8%) 0.388
OrNotHighMed 1179.51 (3.0%) 1168.53
(3.2%) -0.9% ( -6% - 5%) 0.338
BrowseRandomLabelSSDVFacets 14.52 (1.9%) 14.40
(1.9%) -0.8% ( -4% - 3%) 0.186
LowTerm 1589.67 (3.6%) 1579.95
(4.6%) -0.6% ( -8% - 7%) 0.642
MedTermDayTaxoFacets 52.00 (4.3%) 51.70
(4.3%) -0.6% ( -8% - 8%) 0.672
OrHighNotHigh 1008.27 (5.9%) 1002.78
(5.1%) -0.5% ( -10% - 11%) 0.756
LowIntervalsOrdered 11.03 (4.8%) 10.98
(4.4%) -0.5% ( -9% - 9%) 0.724
OrHighMedDayTaxoFacets 22.72 (3.5%) 22.64
(3.1%) -0.4% ( -6% - 6%) 0.718
OrHighLow 899.20 (3.3%) 896.35
(3.0%) -0.3% ( -6% - 6%) 0.750
MedIntervalsOrdered 43.37 (3.6%) 43.25
(3.7%) -0.3% ( -7% - 7%) 0.799
HighIntervalsOrdered 24.44 (5.3%) 24.37
(5.5%) -0.3% ( -10% - 11%) 0.864
OrNotHighLow 1448.52 (4.0%) 1446.40
(3.5%) -0.1% ( -7% - 7%) 0.901
LowSpanNear 85.70 (2.4%) 85.59
(2.2%) -0.1% ( -4% - 4%) 0.851
AndHighLow 1043.29 (5.2%) 1042.26
(3.9%) -0.1% ( -8% - 9%) 0.946
PKLookup 236.83 (1.4%) 236.69
(2.2%) -0.1% ( -3% - 3%) 0.919
HighTermTitleBDVSort 25.03 (3.5%) 25.02
(2.6%) -0.0% ( -5% - 6%) 0.977
Wildcard 156.78 (1.9%) 156.93
(1.8%) 0.1% ( -3% - 3%) 0.877
MedSpanNear 214.11 (4.2%) 214.32
(2.9%) 0.1% ( -6% - 7%) 0.929
Fuzzy1 118.50 (1.2%) 118.67
(0.9%) 0.1% ( -1% - 2%) 0.664
Respell 59.34 (1.0%) 59.43
(0.8%) 0.1% ( -1% - 2%) 0.630
Fuzzy2 115.77 (1.1%) 116.01
(1.1%) 0.2% ( -1% - 2%) 0.549
LowSloppyPhrase 89.17 (2.6%) 89.38
(2.6%) 0.2% ( -4% - 5%) 0.771
HighSpanNear 31.18 (4.1%) 31.28
(3.2%) 0.3% ( -6% - 8%) 0.769
OrHighMed 252.02 (3.5%) 252.99
(2.5%) 0.4% ( -5% - 6%) 0.692
AndHighMedDayTaxoFacets 151.31 (2.5%) 152.01
(1.9%) 0.5% ( -3% - 5%) 0.511
HighPhrase 369.49 (3.8%) 371.89
(3.3%) 0.6% ( -6% - 8%) 0.564
LowPhrase 61.86 (3.6%) 62.30
(2.6%) 0.7% ( -5% - 7%) 0.475
AndHighMed 227.16 (4.6%) 228.89
(5.3%) 0.8% ( -8% - 11%) 0.626
TermDTSort 826.30 (2.0%) 833.64
(1.7%) 0.9% ( -2% - 4%) 0.139
AndHighHighDayTaxoFacets 24.63 (3.5%) 24.89
(4.2%) 1.0% ( -6% - 9%) 0.400
MedPhrase 123.13 (3.4%) 124.49
(2.5%) 1.1% ( -4% - 7%) 0.243
IntNRQ 128.88 (4.9%) 130.32
(3.6%) 1.1% ( -7% - 10%) 0.410
HighTermDayOfYearSort 1443.98 (2.1%) 1461.31
(2.0%) 1.2% ( -2% - 5%) 0.063
AndHighHigh 73.59 (4.1%) 74.62
(5.5%) 1.4% ( -7% - 11%) 0.363
BrowseRandomLabelTaxoFacets 35.66 (11.8%) 36.30
(6.0%) 1.8% ( -14% - 22%) 0.544
BrowseDayOfYearTaxoFacets 44.76 (13.7%) 45.89
(4.5%) 2.5% ( -13% - 24%) 0.434
BrowseDateTaxoFacets 43.26 (13.4%) 44.53
(4.8%) 2.9% ( -13% - 24%) 0.354
BrowseMonthSSDVFacets 21.46 (6.7%) 22.22
(10.8%) 3.6% ( -13% - 22%) 0.212
HighTermTitleSort 115.84 (9.7%) 544.03
(15.6%) 369.6% ( 313% - 437%) 0.000
HighTermMonthSort 153.92 (9.5%) 3899.90
(100.4%) 2433.8% (2122% - 2810%) 0.000
{noformat}
If you wonder why the speedup for HighTermTitleSort is lower now, this is
because the tasks file for wikimedium10m didn't include queries that sort on
the title field previously, so I added one query on the HighTerm that had the
highest document frequency. Now wikimedium10m has sorting tasks for all
HighTerms, including some that have a significantly lower document frequency
(which is why the baseline is also faster compared to the previous run, fewer
hits to sort), and this optimization works better with queries that have lots
of matches. For queries that don't match many documents on sort fields that
have a high cardinality, this optimization may even make queries a bit slower.
Here's the performance difference for one of the low-frequency terms
(`fantasy`):
{noformat}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff p-value
LowTermTitleSort 434.21 (7.0%) 427.76
(8.0%) -1.5% ( -15% - 14%) 0.534
LowTermMonthSort 770.08 (5.9%) 1300.48
(21.2%) 68.9% ( 39% - 102%) 0.000
{noformat}
In general I think it's a good trade-off: the slower queries are getting
faster, and queries that are already fast may get just a little bit slower.
> Dynamic pruning for queries sorted by SORTED(_SET) field
> --------------------------------------------------------
>
> Key: LUCENE-10633
> URL: https://issues.apache.org/jira/browse/LUCENE-10633
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> LUCENE-9280 introduced the ability to dynamically prune non-competitive hits
> when sorting by a numeric field, by leveraging the points index to skip
> documents that do not compare better than the top of the priority queue
> maintained by the field comparator.
> However queries sorted by a SORTED(_SET) field still look at all hits, which
> is disappointing. Could we leverage the terms index to skip hits?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]