[
https://issues.apache.org/jira/browse/LUCENE-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7641:
---------------------------------
Attachment: LUCENE-7641.patch
Here is a patch that does not compute the inverse cost and folds in the two
ideas that I mentioned above to make point count estimation more accurate.
I also ran a benchmark with luceneutil10m:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Fuzzy1 190.46 (7.7%) 176.23 (12.6%)
-7.5% ( -25% - 13%)
Respell 277.90 (3.6%) 275.11 (3.0%)
-1.0% ( -7% - 5%)
MedSpanNear 150.45 (2.5%) 149.27 (3.1%)
-0.8% ( -6% - 4%)
Wildcard 112.61 (6.4%) 111.76 (5.9%)
-0.8% ( -12% - 12%)
OrNotHighHigh 60.90 (3.6%) 60.51 (4.3%)
-0.6% ( -8% - 7%)
AndHighLow 1168.30 (3.9%) 1160.93 (3.8%)
-0.6% ( -8% - 7%)
OrHighMed 41.35 (5.1%) 41.16 (5.2%)
-0.4% ( -10% - 10%)
OrHighNotMed 69.18 (3.9%) 68.94 (4.1%)
-0.3% ( -8% - 8%)
OrHighNotHigh 47.52 (3.8%) 47.39 (4.1%)
-0.3% ( -7% - 7%)
LowTerm 691.05 (4.1%) 689.34 (4.5%)
-0.2% ( -8% - 8%)
HighSloppyPhrase 20.35 (3.1%) 20.31 (3.1%)
-0.2% ( -6% - 6%)
HighSpanNear 27.30 (2.0%) 27.25 (1.8%)
-0.2% ( -3% - 3%)
LowPhrase 101.02 (2.2%) 100.92 (2.8%)
-0.1% ( -4% - 4%)
Fuzzy2 68.31 (5.8%) 68.25 (7.1%)
-0.1% ( -12% - 13%)
LowSpanNear 170.99 (2.7%) 170.87 (3.4%)
-0.1% ( -6% - 6%)
LowSloppyPhrase 55.34 (1.9%) 55.30 (2.1%)
-0.1% ( -4% - 4%)
OrHighLow 84.81 (3.8%) 84.81 (4.1%)
0.0% ( -7% - 8%)
AndHighMed 294.21 (2.3%) 294.28 (2.5%)
0.0% ( -4% - 4%)
Prefix3 32.27 (8.8%) 32.28 (8.0%)
0.0% ( -15% - 18%)
HighTermDayOfYearSort 65.88 (2.4%) 65.92 (2.8%)
0.1% ( -5% - 5%)
HighPhrase 16.67 (2.1%) 16.70 (1.8%)
0.1% ( -3% - 4%)
MedPhrase 82.73 (3.2%) 82.86 (3.0%)
0.2% ( -5% - 6%)
MedSloppyPhrase 42.22 (2.1%) 42.29 (2.3%)
0.2% ( -4% - 4%)
OrNotHighLow 889.62 (3.3%) 891.74 (4.9%)
0.2% ( -7% - 8%)
OrHighHigh 32.35 (5.2%) 32.43 (5.2%)
0.3% ( -9% - 11%)
OrNotHighMed 158.41 (3.3%) 158.83 (3.6%)
0.3% ( -6% - 7%)
AndHighHigh 47.84 (1.4%) 47.98 (1.8%)
0.3% ( -2% - 3%)
MedTerm 240.59 (4.1%) 241.51 (3.6%)
0.4% ( -7% - 8%)
OrHighNotLow 111.19 (4.7%) 111.65 (5.3%)
0.4% ( -9% - 10%)
HighTerm 100.65 (4.3%) 101.63 (4.6%)
1.0% ( -7% - 10%)
HighTermMonthSort 140.07 (11.1%) 141.91 (10.5%)
1.3% ( -18% - 25%)
IntNRQ10 76.64 (9.9%) 80.71 (2.8%)
5.3% ( -6% - 19%)
IntNRQ50 17.16 (11.4%) 18.11 (2.7%)
5.6% ( -7% - 22%)
IntNRQ25 33.52 (11.0%) 35.61 (3.0%)
6.2% ( -6% - 22%)
IntNRQ 12.62 (11.8%) 15.84 (4.4%)
25.6% ( 8% - 47%)
IntNRQ75 11.57 (11.7%) 15.18 (4.7%)
31.2% ( 13% - 54%)
IntNRQ90 9.71 (11.6%) 13.84 (5.6%)
42.6% ( 22% - 67%)
{noformat}
This looks like a good speedup to me.
The ranges that are followed by a number are some additional ranges that I
added in order to see the impact based on the percentage of matched documents.
For instance {{IntNRQ75}} has ranges that match 75% of documents. Here are the
tasks that I appended to {{wikimedium.10M.nostopwords.tasks}} if you would like
to reproduce:
{noformat}
IntNRQ10: nrq//timesecnum 6207 15563
IntNRQ10: nrq//timesecnum 53 8742
IntNRQ10: nrq//timesecnum 77160 84811
IntNRQ25: nrq//timesecnum 8899 36335
IntNRQ25: nrq//timesecnum 68109 86101
IntNRQ25: nrq//timesecnum 44771 64123
IntNRQ50: nrq//timesecnum 101 51223
IntNRQ50: nrq//timesecnum 25811 69953
IntNRQ50: nrq//timesecnum 46298 82710
IntNRQ75: nrq//timesecnum 13867 79886
IntNRQ75: nrq//timesecnum 8830 75922
IntNRQ75: nrq//timesecnum 17003 82111
IntNRQ90: nrq//timesecnum 1810 80159
IntNRQ90: nrq//timesecnum 6160 84811
IntNRQ90: nrq//timesecnum 8210 86330
{noformat}
It is interesting that small ranges seem to also benefit from a small speedup
(I reran the benchmark to confirm and it looks reproducible). I am unsure why,
but suspect that the fact that we use a different code path for large ranges
makes things easier for the JVM to optimize on small ranges (?).
> Speed up point ranges that match most documents
> -----------------------------------------------
>
> Key: LUCENE-7641
> URL: https://issues.apache.org/jira/browse/LUCENE-7641
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7461.patch, LUCENE-7641.patch
>
>
> If a point range matches most documents and every document has exactly one
> value, then we could make things faster by computing the set of documents
> that do NOT match the range instead.
> It was not possible until recently since figuring out whether a range query
> matches most documents was not possible, but we can now use the new
> {{PointValues.estimatePointcount}} API to do that: we could just check
> whether the cost of the inverse visitor is lower than the cost of the regular
> range visitor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]