[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

Adrien Grand (Jira) Tue, 18 May 2021 11:21:04 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347104#comment-17347104
 ]


Adrien Grand edited comment on LUCENE-9335 at 5/18/21, 6:20 PM:
----------------------------------------------------------------

The speedup for some of the slower queries looks great. I know Fuzzy1 and 
Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your 
change makes them faster?

I wanted to do some more tests so I played with the MSMARCO passages dataset, 
which has the interesting property of having queries that have several terms 
(often around 8-10). See the attached benchmark if you are interested, here are 
the outputs I'm getting for various scorers:

BMW
{noformat}
AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894
{noformat}

BMM - scorer
{noformat}
AVG: 4.1779829712E7
Median: 28701530
P75: 57780117
P90: 103794862
P95: 130582282
P99: 215559175
Collected AVG: 460.482
Collected Median: 143
Collected P75: 158
Collected P90: 180
Collected P95: 2316
Collected P99: 7277
{noformat}

BMM - bulk scorer
{noformat}
AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467
{noformat}

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?


was (Author: jpountz):
The speedup for some of the slower queries looks great. I know Fuzzy1 and 
Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your 
change makes them faster?

I wanted to do some more tests so I played with the MSMARCO dataset, which has 
the interesting property of having queries that have several terms (often 
around 8-10). See the attached benchmark if you are interested, here are the 
outputs I'm getting for various scorers:

BMW
{noformat}
AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894
{noformat}

BMM - scorer
{noformat}
AVG: 4.1779829712E7
Median: 28701530
P75: 57780117
P90: 103794862
P95: 130582282
P99: 215559175
Collected AVG: 460.482
Collected Median: 143
Collected P75: 158
Collected P90: 180
Collected P95: 2316
Collected P99: 7277
{noformat}

BMM - bulk scorer
{noformat}
AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467
{noformat}

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?

> Add a bulk scorer for disjunctions that does dynamic pruning
> ------------------------------------------------------------
>
>                 Key: LUCENE-9335
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9335
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>          Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

Reply via email to