[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344973#comment-17344973 ]
Zach Chen commented on LUCENE-9335: ----------------------------------- Just want to provide a quick summary of the latest progress of this issue. Currently there are 3 different BMM implementations from 2 PRs: # Scorer based implementation ## PR: [https://github.com/apache/lucene/pull/101] ## wikibigall benchmark results: [https://github.com/apache/lucene/pull/101#issuecomment-840255508] ### On average it improves _OrHighHigh_ by 40%+, and _OrHighMed_ around 20% ### 1 out of 3 runs it hurt _AndMedOrHighHigh_ and _OrHighMed_ performance by around 16% # BulkScorer based implementation with fixed window size ## PR: [https://github.com/apache/lucene/pull/113] ## wikibigall benchmark with window size 1024 results: [https://github.com/apache/lucene/pull/113#issuecomment-840293637] ### On average it improves _OrHighHigh_ by 3-8%, and _OrHighMed_ by 23%+ ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance by around 8%, even though it wasn't used for those queries # BulkScorer based implementation without window, and using the scorer implementation from #1 ## Commit: [https://github.com/zacharymorn/lucene/commit/3bcdbb31a7d55b00cb53e4be40a4adc93b9f30db] ## wikibigall benchmark results: [https://github.com/apache/lucene/pull/113#discussion_r631568912] ### On average it improves _OrHighHigh by 52%, and_ _OrHighMed 10% - 18%_ ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by around 8%-13%, even though it wasn't used for those queries [~jpountz] what do you think about the above results as well as the latest changes, and any other idea we would like to try on? From the current results it appears option 1 might be the one to go with? I can start to work on productizing the changes and adding tests if we have settled down on the implementation approach here. > Add a bulk scorer for disjunctions that does dynamic pruning > ------------------------------------------------------------ > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org