[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347104#comment-17347104 ]
Adrien Grand edited comment on LUCENE-9335 at 5/18/21, 6:20 PM: ---------------------------------------------------------------- The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster? I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers: BMW {noformat} AVG: 1.0851470951E7 Median: 5552285 P75: 12087216 P90: 26834970 P95: 40460199 P99: 77821369 Collected AVG: 8168.523 Collected Median: 2259 Collected P75: 3735 Collected P90: 6228 Collected P95: 13063 Collected P99: 221894 {noformat} BMM - scorer {noformat} AVG: 4.1779829712E7 Median: 28701530 P75: 57780117 P90: 103794862 P95: 130582282 P99: 215559175 Collected AVG: 460.482 Collected Median: 143 Collected P75: 158 Collected P90: 180 Collected P95: 2316 Collected P99: 7277 {noformat} BMM - bulk scorer {noformat} AVG: 5.3372459518E7 Median: 18658182 P75: 60750919 P90: 143040509 P95: 227538646 P99: 461590829 Collected AVG: 525419.23 Collected Median: 109750 Collected P75: 563404 Collected P90: 1651320 Collected P95: 2597310 Collected P99: 4508467 {noformat} Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM? was (Author: jpountz): The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster? I wanted to do some more tests so I played with the MSMARCO dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers: BMW {noformat} AVG: 1.0851470951E7 Median: 5552285 P75: 12087216 P90: 26834970 P95: 40460199 P99: 77821369 Collected AVG: 8168.523 Collected Median: 2259 Collected P75: 3735 Collected P90: 6228 Collected P95: 13063 Collected P99: 221894 {noformat} BMM - scorer {noformat} AVG: 4.1779829712E7 Median: 28701530 P75: 57780117 P90: 103794862 P95: 130582282 P99: 215559175 Collected AVG: 460.482 Collected Median: 143 Collected P75: 158 Collected P90: 180 Collected P95: 2316 Collected P99: 7277 {noformat} BMM - bulk scorer {noformat} AVG: 5.3372459518E7 Median: 18658182 P75: 60750919 P90: 143040509 P95: 227538646 P99: 461590829 Collected AVG: 525419.23 Collected Median: 109750 Collected P75: 563404 Collected P90: 1651320 Collected P95: 2597310 Collected P99: 4508467 {noformat} Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM? > Add a bulk scorer for disjunctions that does dynamic pruning > ------------------------------------------------------------ > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org