[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873 ] Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM: -- {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it doesn't get written to BooleanQuery for most of the time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! was (Author: zacharymorn): {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is not used there. In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 50m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873 ] Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM: -- {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it very much doesn't get re-written to BooleanQuery|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! was (Author: zacharymorn): {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it doesn't get written to BooleanQuery for most of the time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 50m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028 ] Zach Chen edited comment on LUCENE-10061 at 11/5/21, 4:50 AM: -- Hi [~jpountz], I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation ([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59]) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following performance test results (perf test index rebuilt to enable norm for title field, task file attached, and luceneutil integration available at [https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:] {code:java} # Run 1 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 4.64 (6.5%) 3.30 (4.7%) -29.0% ( -37% - -19%) 0.000 CFQHighHigh 11.09 (6.0%) 9.61 (6.0%) -13.3% ( -23% - -1%) 0.000 PKLookup 103.38 (4.4%) 108.04 (4.3%) 4.5% ( -4% - 13%) 0.001 CFQHighMedLow 10.58 (6.1%) 12.30 (8.7%) 16.2% ( 1% - 33%) 0.000 CFQHighMed 10.70 (7.4%) 15.51 (11.2%) 44.9% ( 24% - 68%) 0.000 CFQHighLowLow 8.18 (8.2%) 12.87 (11.6%) 57.3% ( 34% - 84%) 0.000 CFQHighLow 14.57 (7.5%) 30.81 (15.1%) 111.4% ( 82% - 144%) 0.000 # Run 2 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.33 (5.7%) 4.02 (7.7%) -24.4% ( -35% - -11%) 0.000 CFQHighLowLow 17.14 (6.2%) 13.06 (5.4%) -23.8% ( -33% - -13%) 0.000 CFQHighMed 17.37 (5.8%) 14.38 (7.7%) -17.2% ( -29% - -3%) 0.000 PKLookup 103.57 (5.5%) 108.84 (5.9%) 5.1% ( -6% - 17%) 0.005 CFQHighMedLow 11.25 (7.2%) 12.70 (9.0%) 12.9% ( -3% - 31%) 0.000 CFQHighHigh 5.00 (6.2%) 7.54 (12.1%) 51.0% ( 30% - 73%) 0.000 CFQHighLow 21.60 (5.2%) 34.57 (14.1%) 60.0% ( 38% - 83%) 0.000 # Run 3 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.40 (6.9%) 4.06 (5.1%) -24.8% ( -34% - -13%) 0.000 CFQHighMedLow 7.64 (7.4%) 5.79 (6.3%) -24.2% ( -35% - -11%) 0.000 CFQHighHigh 11.11 (7.0%) 9.60 (5.9%) -13.6% ( -24% - 0%) 0.000 CFQHighLowLow 21.21 (7.6%) 21.22 (6.6%) 0.0% ( -13% - 15%) 0.993 PKLookup 103.15 (5.9%) 107.60 (6.9%) 4.3% ( -8% - 18%) 0.034 CFQHighLow 21.85 (8.1%) 34.18 (13.5%) 56.4% ( 32% - 84%) 0.000 CFQHighMed 12.07 (8.4%) 19.98 (16.7%) 65.5% ( 37% - 98%) 0.000 # Run 4 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 8.50 (5.8%) 6.85 (5.2%) -19.5% ( -28% - -8%) 0.000 CFQHighMedLow 10.89 (5.7%) 8.96 (5.4%) -17.8% ( -27% - -7%) 0.000 CFQHighMed 8.41 (5.8%) 7.74 (5.6%) -7.9% ( -18% - 3%) 0.000 CFQHighHighHigh 3.45 (6.7%) 3.38 (5.3%) -2.0% ( -13% - 10%) 0.287 CFQHighLowLow 7.82 (6.4%) 8.20 (7.5%) 4.8% ( -8% - 20%) 0.030 PKLookup 103.50 (5.0%) 110.69 (5.4%) 6.9% ( -3% - 18%) 0.000 CFQHighLow 11.46 (6.0%) 13.16 (6.7%) 14.8% ( 1% - 29%) 0.000 {code} I think overall this shows that the pruning will be most effective when there's a significant difference between terms' frequencies, but will slow things down if they are close, as the cost of pruning outweighs the efficacy of skipping. I'm wondering if we should then gate the pruning by checking the frequencies as well, but from some quick trials that seems to be an expensive operation? Do you have any recommendation for this scenario? was (Author: zacharymorn): Hi [~jpountz], I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation ([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59]) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following