[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570417#comment-17570417 ] Zach Chen commented on LUCENE-10480: >From the latest nightly benchmark result, the negative impact to nested >boolean queries have been resolved, and the performance boost to top-level >disjunction queries have been maintained. Thanks for all the guidance >[~jpountz] ! > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Time Spent: 11h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen resolved LUCENE-10480. Assignee: Zach Chen Resolution: Done > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Time Spent: 11h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149 ] Zach Chen edited comment on LUCENE-10480 at 7/13/22 5:09 AM: - {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. was (Author: zacharymorn): {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149 ] Zach Chen commented on LUCENE-10480: {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261 ] Zach Chen commented on LUCENE-10480: {quote}Another thing that changes performance sometimes is the doc ID order, were you using multiple indexing threads maybe? {quote} Ok this is actually the case for me. I was previously using 10 threads to index (INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with default setting, I was able to reproduce the slowdown: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 91.27 (4.3%) 85.52 (4.3%) -6.3% ( -14% - 2%) 0.000 PKLookup 333.25 (4.3%) 329.48 (3.8%) -1.1% ( -8% - 7%) 0.380 AndHighHigh 104.25 (2.9%) 103.11 (3.0%) -1.1% ( -6% - 5%) 0.247 SpanNear 16.52 (3.8%) 16.36 (3.1%) -0.9% ( -7% - 6%) 0.396 TermGroup10K 23.99 (3.3%) 23.78 (3.0%) -0.9% ( -6% - 5%) 0.384 Phrase 234.74 (2.7%) 232.71 (1.8%) -0.9% ( -5% - 3%) 0.235 AndHighMed 163.80 (3.5%) 162.42 (4.3%) -0.8% ( -8% - 7%) 0.496 TermBGroup1M 48.02 (3.5%) 47.65 (3.7%) -0.8% ( -7% - 6%) 0.496 SloppyPhrase 4.82 (3.4%) 4.78 (2.7%) -0.7% ( -6% - 5%) 0.460 TermGroup100 41.90 (3.9%) 41.63 (3.3%) -0.7% ( -7% - 6%) 0.569 Term 2680.42 (4.7%) 2664.05 (3.3%) -0.6% ( -8% - 7%) 0.632 TermGroup1M 39.95 (2.9%) 39.71 (3.2%) -0.6% ( -6% - 5%) 0.531 TermBGroup1M1P 84.21 (6.1%) 83.82 (5.7%) -0.5% ( -11% - 12%) 0.801 Respell 113.78 (1.9%) 113.44 (1.7%) -0.3% ( -3% - 3%) 0.603 BrowseRandomLabelSSDVFacets 20.75 (8.2%) 20.74 (10.3%) -0.0% ( -17% - 20%) 0.989 Fuzzy2 83.12 (1.8%) 83.11 (1.1%) -0.0% ( -2% - 2%) 0.976 BrowseDayOfYearSSDVFacets 26.69 (12.0%) 26.70 (11.6%) 0.0% ( -21% - 26%) 0.995 Wildcard 115.84 (5.1%) 115.96 (5.8%) 0.1% ( -10% - 11%) 0.951 TermDayOfYearSort 260.70 (5.4%) 260.99 (2.8%) 0.1% ( -7% - 8%) 0.937 AndHighMedDayTaxoFacets 136.32 (2.6%) 136.63 (2.3%) 0.2% ( -4% - 5%) 0.773 IntervalsOrdered 128.13 (7.5%) 128.45 (7.7%) 0.3% ( -13% - 16%) 0.916 AndHighHighDayTaxoFacets 13.82 (2.8%) 13.87 (2.6%) 0.4% ( -4% - 5%) 0.657 Fuzzy1 79.16 (2.7%) 79.60 (1.8%) 0.6% ( -3% - 5%) 0.433 TermMonthSort 360.17 (6.4%) 362.83 (7.1%) 0.7% ( -11% - 15%) 0.728 TermTitleSort 191.21 (6.8%) 192.70 (7.1%) 0.8% ( -12% - 15%) 0.723 TermDTSort 208.40 (2.9%) 210.39 (2.9%) 1.0% ( -4% - 7%) 0.301 MedTermDayTaxoFacets 78.66 (5.2%) 79.59 (4.4%) 1.2% ( -7% - 11%) 0.436 TermDateFacets 41.04 (5.4%) 41.61 (4.7%) 1.4% ( -8% - 12%) 0.385 IntNRQ 122.00 (8.1%) 124.08 (8.3%) 1.7% ( -13% - 19%) 0.513 OrHighMedDayTaxoFacets 23.16 (8.4%) 23.71 (4.9%) 2.4% ( -10% - 17%) 0.272 BrowseMonthSSDVFacets 28.68 (13.8%) 29.55 (16.8%) 3.0% ( -24% - 39%) 0.531 BrowseDayOfYearTaxoFacets 30.40 (32.2%) 31.67 (34.2%) 4.2% ( -47% - 103%) 0.690 BrowseDateTaxoFacets 30.26 (32.2%) 31.57 (34.4%) 4.3% ( -47% - 104%) 0.680 Prefix3 402.14 (8.6%) 419.96 (8.9%) 4.4% ( -12% - 23%) 0.109 AndMedOrHighHigh 94.79 (4.0%) 99.03 (4.5%) 4.5% ( -3% - 13%) 0.001 BrowseRandomLabelTaxoFacets 32.45 (49.2%) 35.05 (53.4%) 8.0% ( -63% - 217%) 0.622 BrowseMonthTaxoFacets 28.68 (35.3%) 31.37 (39.1%) 9.4% ( -48% - 129%) 0.425 BrowseDateSSDVFacets 3.96 (28.1%) 4.54 (26.3%) 14.7% ( -31% - 96%) 0.089
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261 ] Zach Chen edited comment on LUCENE-10480 at 7/12/22 4:27 AM: - {quote}Another thing that changes performance sometimes is the doc ID order, were you using multiple indexing threads maybe? {quote} Ok this is actually the case for me. I was previously using 10 threads to index (INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with default setting, I was able to reproduce the slowdown: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 91.27 (4.3%) 85.52 (4.3%) -6.3% ( -14% - 2%) 0.000 PKLookup 333.25 (4.3%) 329.48 (3.8%) -1.1% ( -8% - 7%) 0.380 AndHighHigh 104.25 (2.9%) 103.11 (3.0%) -1.1% ( -6% - 5%) 0.247 SpanNear 16.52 (3.8%) 16.36 (3.1%) -0.9% ( -7% - 6%) 0.396 TermGroup10K 23.99 (3.3%) 23.78 (3.0%) -0.9% ( -6% - 5%) 0.384 Phrase 234.74 (2.7%) 232.71 (1.8%) -0.9% ( -5% - 3%) 0.235 AndHighMed 163.80 (3.5%) 162.42 (4.3%) -0.8% ( -8% - 7%) 0.496 TermBGroup1M 48.02 (3.5%) 47.65 (3.7%) -0.8% ( -7% - 6%) 0.496 SloppyPhrase 4.82 (3.4%) 4.78 (2.7%) -0.7% ( -6% - 5%) 0.460 TermGroup100 41.90 (3.9%) 41.63 (3.3%) -0.7% ( -7% - 6%) 0.569 Term 2680.42 (4.7%) 2664.05 (3.3%) -0.6% ( -8% - 7%) 0.632 TermGroup1M 39.95 (2.9%) 39.71 (3.2%) -0.6% ( -6% - 5%) 0.531 TermBGroup1M1P 84.21 (6.1%) 83.82 (5.7%) -0.5% ( -11% - 12%) 0.801 Respell 113.78 (1.9%) 113.44 (1.7%) -0.3% ( -3% - 3%) 0.603 BrowseRandomLabelSSDVFacets 20.75 (8.2%) 20.74 (10.3%) -0.0% ( -17% - 20%) 0.989 Fuzzy2 83.12 (1.8%) 83.11 (1.1%) -0.0% ( -2% - 2%) 0.976 BrowseDayOfYearSSDVFacets 26.69 (12.0%) 26.70 (11.6%) 0.0% ( -21% - 26%) 0.995 Wildcard 115.84 (5.1%) 115.96 (5.8%) 0.1% ( -10% - 11%) 0.951 TermDayOfYearSort 260.70 (5.4%) 260.99 (2.8%) 0.1% ( -7% - 8%) 0.937 AndHighMedDayTaxoFacets 136.32 (2.6%) 136.63 (2.3%) 0.2% ( -4% - 5%) 0.773 IntervalsOrdered 128.13 (7.5%) 128.45 (7.7%) 0.3% ( -13% - 16%) 0.916 AndHighHighDayTaxoFacets 13.82 (2.8%) 13.87 (2.6%) 0.4% ( -4% - 5%) 0.657 Fuzzy1 79.16 (2.7%) 79.60 (1.8%) 0.6% ( -3% - 5%) 0.433 TermMonthSort 360.17 (6.4%) 362.83 (7.1%) 0.7% ( -11% - 15%) 0.728 TermTitleSort 191.21 (6.8%) 192.70 (7.1%) 0.8% ( -12% - 15%) 0.723 TermDTSort 208.40 (2.9%) 210.39 (2.9%) 1.0% ( -4% - 7%) 0.301 MedTermDayTaxoFacets 78.66 (5.2%) 79.59 (4.4%) 1.2% ( -7% - 11%) 0.436 TermDateFacets 41.04 (5.4%) 41.61 (4.7%) 1.4% ( -8% - 12%) 0.385 IntNRQ 122.00 (8.1%) 124.08 (8.3%) 1.7% ( -13% - 19%) 0.513 OrHighMedDayTaxoFacets 23.16 (8.4%) 23.71 (4.9%) 2.4% ( -10% - 17%) 0.272 BrowseMonthSSDVFacets 28.68 (13.8%) 29.55 (16.8%) 3.0% ( -24% - 39%) 0.531 BrowseDayOfYearTaxoFacets 30.40 (32.2%) 31.67 (34.2%) 4.2% ( -47% - 103%) 0.690 BrowseDateTaxoFacets 30.26 (32.2%) 31.57 (34.4%) 4.3% ( -47% - 104%) 0.680 Prefix3 402.14 (8.6%) 419.96 (8.9%) 4.4% ( -12% - 23%) 0.109 AndMedOrHighHigh 94.79 (4.0%) 99.03 (4.5%) 4.5% ( -3% - 13%) 0.001 BrowseRandomLabelTaxoFacets 32.45 (49.2%) 35.05 (53.4%) 8.0% ( -63% - 217%) 0.622 BrowseMonthTaxoFacets 28.68 (35.3%) 31.37 (39.1%) 9.4% ( -48% - 129%) 0.425 BrowseDateSSDVFacets 3.96 (28.1%) 4.54 (26.3%)
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261 ] Zach Chen edited comment on LUCENE-10480 at 7/12/22 4:27 AM: - {quote}Another thing that changes performance sometimes is the doc ID order, were you using multiple indexing threads maybe? {quote} Ok this is actually the case for me. I was previously using 10 threads to index (INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with default setting, I was able to reproduce the slowdown: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 91.27 (4.3%) 85.52 (4.3%) -6.3% ( -14% - 2%) 0.000 PKLookup 333.25 (4.3%) 329.48 (3.8%) -1.1% ( -8% - 7%) 0.380 AndHighHigh 104.25 (2.9%) 103.11 (3.0%) -1.1% ( -6% - 5%) 0.247 SpanNear 16.52 (3.8%) 16.36 (3.1%) -0.9% ( -7% - 6%) 0.396 TermGroup10K 23.99 (3.3%) 23.78 (3.0%) -0.9% ( -6% - 5%) 0.384 Phrase 234.74 (2.7%) 232.71 (1.8%) -0.9% ( -5% - 3%) 0.235 AndHighMed 163.80 (3.5%) 162.42 (4.3%) -0.8% ( -8% - 7%) 0.496 TermBGroup1M 48.02 (3.5%) 47.65 (3.7%) -0.8% ( -7% - 6%) 0.496 SloppyPhrase 4.82 (3.4%) 4.78 (2.7%) -0.7% ( -6% - 5%) 0.460 TermGroup100 41.90 (3.9%) 41.63 (3.3%) -0.7% ( -7% - 6%) 0.569 Term 2680.42 (4.7%) 2664.05 (3.3%) -0.6% ( -8% - 7%) 0.632 TermGroup1M 39.95 (2.9%) 39.71 (3.2%) -0.6% ( -6% - 5%) 0.531 TermBGroup1M1P 84.21 (6.1%) 83.82 (5.7%) -0.5% ( -11% - 12%) 0.801 Respell 113.78 (1.9%) 113.44 (1.7%) -0.3% ( -3% - 3%) 0.603 BrowseRandomLabelSSDVFacets 20.75 (8.2%) 20.74 (10.3%) -0.0% ( -17% - 20%) 0.989 Fuzzy2 83.12 (1.8%) 83.11 (1.1%) -0.0% ( -2% - 2%) 0.976 BrowseDayOfYearSSDVFacets 26.69 (12.0%) 26.70 (11.6%) 0.0% ( -21% - 26%) 0.995 Wildcard 115.84 (5.1%) 115.96 (5.8%) 0.1% ( -10% - 11%) 0.951 TermDayOfYearSort 260.70 (5.4%) 260.99 (2.8%) 0.1% ( -7% - 8%) 0.937 AndHighMedDayTaxoFacets 136.32 (2.6%) 136.63 (2.3%) 0.2% ( -4% - 5%) 0.773 IntervalsOrdered 128.13 (7.5%) 128.45 (7.7%) 0.3% ( -13% - 16%) 0.916 AndHighHighDayTaxoFacets 13.82 (2.8%) 13.87 (2.6%) 0.4% ( -4% - 5%) 0.657 Fuzzy1 79.16 (2.7%) 79.60 (1.8%) 0.6% ( -3% - 5%) 0.433 TermMonthSort 360.17 (6.4%) 362.83 (7.1%) 0.7% ( -11% - 15%) 0.728 TermTitleSort 191.21 (6.8%) 192.70 (7.1%) 0.8% ( -12% - 15%) 0.723 TermDTSort 208.40 (2.9%) 210.39 (2.9%) 1.0% ( -4% - 7%) 0.301 MedTermDayTaxoFacets 78.66 (5.2%) 79.59 (4.4%) 1.2% ( -7% - 11%) 0.436 TermDateFacets 41.04 (5.4%) 41.61 (4.7%) 1.4% ( -8% - 12%) 0.385 IntNRQ 122.00 (8.1%) 124.08 (8.3%) 1.7% ( -13% - 19%) 0.513 OrHighMedDayTaxoFacets 23.16 (8.4%) 23.71 (4.9%) 2.4% ( -10% - 17%) 0.272 BrowseMonthSSDVFacets 28.68 (13.8%) 29.55 (16.8%) 3.0% ( -24% - 39%) 0.531 BrowseDayOfYearTaxoFacets 30.40 (32.2%) 31.67 (34.2%) 4.2% ( -47% - 103%) 0.690 BrowseDateTaxoFacets 30.26 (32.2%) 31.57 (34.4%) 4.3% ( -47% - 104%) 0.680 Prefix3 402.14 (8.6%) 419.96 (8.9%) 4.4% ( -12% - 23%) 0.109 AndMedOrHighHigh 94.79 (4.0%) 99.03 (4.5%) 4.5% ( -3% - 13%) 0.001 BrowseRandomLabelTaxoFacets 32.45 (49.2%) 35.05 (53.4%) 8.0% ( -63% - 217%) 0.622 BrowseMonthTaxoFacets 28.68 (35.3%) 31.37 (39.1%) 9.4% ( -48% - 129%) 0.425 BrowseDateSSDVFacets 3.96 (28.1%) 4.54 (26.3%)
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564747#comment-17564747 ] Zach Chen commented on LUCENE-10480: {quote}I'll see if I can run the original nightly benchmark code / tests from my machine to see if there's any difference. {quote} I tried to run ** *nightlyBench.py* locally on my machine over the weekend, but that turns out to require some changes to the script itself, and I haven't been able to run it fully so far. On the other hand, I tried a few more run configurations with ** *localrun.py,* including running it in a virtual ubuntu box (as the nightly benchmark runs on linux box), but still have no luck so far re-producing the [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] slow-down. [~jpountz], just curious, are you able to reproduce the slow-down locally on your end as well ? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564611#comment-17564611 ] Zach Chen commented on LUCENE-10480: {quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html] recovered fully but [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] only a bit. I'm unsure what explains there is still a slowdown compared to BMW. {quote} Hmm this is quite strange. Looks like [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] was still having about -13% (5 / 38) impact. I just ran the full suite of wikinightly tasks a few times (by copying *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source *wikimedium10m,* and removing *VectorSearch* queries as they were causing failure NPE for me) but couldn't reproduce the slow down (baseline is using head before all BMM changes): {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseRandomLabelSSDVFacets 20.83 (3.8%) 20.09 (6.5%) -3.6% ( -13% - 6%) 0.034 BrowseMonthSSDVFacets 30.36 (10.6%) 29.56 (12.7%) -2.7% ( -23% - 23%) 0.473 Prefix3 402.70 (9.3%) 397.59 (9.9%) -1.3% ( -18% - 19%) 0.674 TermDayOfYearSort 183.55 (6.5%) 181.61 (6.9%) -1.1% ( -13% - 13%) 0.617 TermTitleSort 195.99 (7.2%) 194.25 (8.1%) -0.9% ( -15% - 15%) 0.713 PKLookup 293.80 (3.7%) 291.47 (4.8%) -0.8% ( -8% - 7%) 0.555 TermMonthSort 283.86 (7.1%) 281.74 (8.0%) -0.7% ( -14% - 15%) 0.755 Wildcard 227.26 (6.2%) 225.87 (6.4%) -0.6% ( -12% - 12%) 0.759 Term 2227.50 (3.7%) 2219.57 (3.3%) -0.4% ( -7% - 6%) 0.748 Fuzzy1 134.77 (2.8%) 134.37 (2.3%) -0.3% ( -5% - 4%) 0.712 TermGroup100 53.61 (3.7%) 53.47 (4.6%) -0.3% ( -8% - 8%) 0.846 TermDTSort 143.16 (3.2%) 142.89 (3.3%) -0.2% ( -6% - 6%) 0.857 TermBGroup1M1P 79.44 (5.5%) 79.29 (5.5%) -0.2% ( -10% - 11%) 0.917 AndHighHighDayTaxoFacets 45.01 (2.3%) 44.94 (2.1%) -0.1% ( -4% - 4%) 0.833 BrowseRandomLabelTaxoFacets 30.94 (50.0%) 30.92 (46.8%) -0.0% ( -64% - 193%) 0.998 AndHighMedDayTaxoFacets 78.11 (3.2%) 78.11 (3.0%) -0.0% ( -6% - 6%) 0.998 Phrase 202.17 (2.7%) 202.18 (2.0%) 0.0% ( -4% - 4%) 0.996 Fuzzy2 76.10 (2.6%) 76.15 (2.0%) 0.1% ( -4% - 4%) 0.933 TermGroup1M 22.65 (3.8%) 22.67 (3.2%) 0.1% ( -6% - 7%) 0.919 TermDateFacets 32.50 (5.3%) 32.60 (5.5%) 0.3% ( -9% - 11%) 0.861 BrowseDayOfYearSSDVFacets 26.31 (5.9%) 26.39 (8.5%) 0.3% ( -13% - 15%) 0.897 Respell 88.21 (2.2%) 88.49 (2.1%) 0.3% ( -3% - 4%) 0.642 SpanNear 16.14 (4.0%) 16.22 (4.2%) 0.5% ( -7% - 9%) 0.706 MedTermDayTaxoFacets 73.42 (4.8%) 73.85 (4.9%) 0.6% ( -8% - 10%) 0.708 TermBGroup1M 48.92 (4.2%) 49.23 (2.8%) 0.6% ( -6% - 8%) 0.581 IntervalsOrdered 22.42 (5.8%) 22.59 (4.2%) 0.7% ( -8% - 11%) 0.651 OrHighMedDayTaxoFacets 25.27 (6.1%) 25.46 (6.6%) 0.7% ( -11% - 14%) 0.711 TermGroup10K 30.26 (4.2%) 30.50 (2.9%) 0.8% ( -6% - 8%) 0.494 SloppyPhrase 91.40 (5.6%) 92.16 (6.3%) 0.8% ( -10% - 13%) 0.662 IntNRQ 152.74 (20.3%) 154.86 (17.1%) 1.4% ( -29% - 48%) 0.815 AndHighMed 88.55 (2.6%) 89.98 (3.1%) 1.6% ( -3% - 7%) 0.073 AndHighHigh 29.10 (2.7%) 29.68 (3.1%) 2.0% ( -3% - 8%) 0.032 BrowseDayOfYearTaxoFacets 31.29 (40.0%) 31.93 (38.0%) 2.0% ( -54% - 133%) 0.869 BrowseDateTaxoFacets 31.18 (40.3%) 31.87 (38.5%) 2.2% ( -54% -
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564611#comment-17564611 ] Zach Chen edited comment on LUCENE-10480 at 7/9/22 7:25 PM: {quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html] recovered fully but [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] only a bit. I'm unsure what explains there is still a slowdown compared to BMW. {quote} Hmm this is quite strange. Looks like [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] was still having about -13% (5 / 38) impact. I just ran the full suite of wikinightly tasks a few times (by copying *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source *wikimedium10m,* and removing *VectorSearch* queries as they were causing failure NPE for me) but couldn't reproduce the slow down (baseline is using head before all BMM changes): {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseRandomLabelSSDVFacets 20.83 (3.8%) 20.09 (6.5%) -3.6% ( -13% - 6%) 0.034 BrowseMonthSSDVFacets 30.36 (10.6%) 29.56 (12.7%) -2.7% ( -23% - 23%) 0.473 Prefix3 402.70 (9.3%) 397.59 (9.9%) -1.3% ( -18% - 19%) 0.674 TermDayOfYearSort 183.55 (6.5%) 181.61 (6.9%) -1.1% ( -13% - 13%) 0.617 TermTitleSort 195.99 (7.2%) 194.25 (8.1%) -0.9% ( -15% - 15%) 0.713 PKLookup 293.80 (3.7%) 291.47 (4.8%) -0.8% ( -8% - 7%) 0.555 TermMonthSort 283.86 (7.1%) 281.74 (8.0%) -0.7% ( -14% - 15%) 0.755 Wildcard 227.26 (6.2%) 225.87 (6.4%) -0.6% ( -12% - 12%) 0.759 Term 2227.50 (3.7%) 2219.57 (3.3%) -0.4% ( -7% - 6%) 0.748 Fuzzy1 134.77 (2.8%) 134.37 (2.3%) -0.3% ( -5% - 4%) 0.712 TermGroup100 53.61 (3.7%) 53.47 (4.6%) -0.3% ( -8% - 8%) 0.846 TermDTSort 143.16 (3.2%) 142.89 (3.3%) -0.2% ( -6% - 6%) 0.857 TermBGroup1M1P 79.44 (5.5%) 79.29 (5.5%) -0.2% ( -10% - 11%) 0.917 AndHighHighDayTaxoFacets 45.01 (2.3%) 44.94 (2.1%) -0.1% ( -4% - 4%) 0.833 BrowseRandomLabelTaxoFacets 30.94 (50.0%) 30.92 (46.8%) -0.0% ( -64% - 193%) 0.998 AndHighMedDayTaxoFacets 78.11 (3.2%) 78.11 (3.0%) -0.0% ( -6% - 6%) 0.998 Phrase 202.17 (2.7%) 202.18 (2.0%) 0.0% ( -4% - 4%) 0.996 Fuzzy2 76.10 (2.6%) 76.15 (2.0%) 0.1% ( -4% - 4%) 0.933 TermGroup1M 22.65 (3.8%) 22.67 (3.2%) 0.1% ( -6% - 7%) 0.919 TermDateFacets 32.50 (5.3%) 32.60 (5.5%) 0.3% ( -9% - 11%) 0.861 BrowseDayOfYearSSDVFacets 26.31 (5.9%) 26.39 (8.5%) 0.3% ( -13% - 15%) 0.897 Respell 88.21 (2.2%) 88.49 (2.1%) 0.3% ( -3% - 4%) 0.642 SpanNear 16.14 (4.0%) 16.22 (4.2%) 0.5% ( -7% - 9%) 0.706 MedTermDayTaxoFacets 73.42 (4.8%) 73.85 (4.9%) 0.6% ( -8% - 10%) 0.708 TermBGroup1M 48.92 (4.2%) 49.23 (2.8%) 0.6% ( -6% - 8%) 0.581 IntervalsOrdered 22.42 (5.8%) 22.59 (4.2%) 0.7% ( -8% - 11%) 0.651 OrHighMedDayTaxoFacets 25.27 (6.1%) 25.46 (6.6%) 0.7% ( -11% - 14%) 0.711 TermGroup10K 30.26 (4.2%) 30.50 (2.9%) 0.8% ( -6% - 8%) 0.494 SloppyPhrase 91.40 (5.6%) 92.16 (6.3%) 0.8% ( -10% - 13%) 0.662 IntNRQ 152.74 (20.3%) 154.86 (17.1%) 1.4% ( -29% - 48%) 0.815 AndHighMed 88.55 (2.6%) 89.98 (3.1%) 1.6% ( -3% - 7%) 0.073 AndHighHigh 29.10 (2.7%) 29.68 (3.1%) 2.0% ( -3% - 8%) 0.032 BrowseDayOfYearTaxoFacets 31.29 (40.0%) 31.93 (38.0%) 2.0% ( -54% - 133%) 0.869 BrowseDateTaxoFacets 31.18 (40.3%)
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563536#comment-17563536 ] Zach Chen commented on LUCENE-10480: Ok I see. Maybe I can also try to run some benchmark experiments with different JVM compilation / code cache parameters to further test things out. Will report back if I find something interesting! > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 6h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562944#comment-17562944 ] Zach Chen commented on LUCENE-10480: {quote}maybe there are bits from advance() that we could move to matches() so that we would hand it over to the other clause before we start doing expensive operations like computing scores. {quote} This approach does help stabilizing performance for disjunction within conjunction queries (and also provide some small gains)! I have opened a PR for it [https://github.com/apache/lucene/pull/1006] . > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 50m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47 (26.6%)
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47 (26.6%)
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:13 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919 ] Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:12 AM: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919 ] Zach Chen commented on LUCENE-10480: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47 (26.6%) 63.37 (35.8%)
[jira] [Commented] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added
[ https://issues.apache.org/jira/browse/LUCENE-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561789#comment-17561789 ] Zach Chen commented on LUCENE-10635: I like this idea! This approach should also be able to preserve most of the assertions in the test utilities. I can give it a try and see how things might look. > Ensure test coverage for WANDScorer after additional scorers get added > -- > > Key: LUCENE-10635 > URL: https://issues.apache.org/jira/browse/LUCENE-10635 > Project: Lucene - Core > Issue Type: Test >Reporter: Zach Chen >Priority: Major > > This is a follow-up issue from discussions > [https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & > [https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] . > > As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in > TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer > instead, reducing test coverage for WANDScorer. We would like to see how we > can ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating > the scorer directly inside the tests? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10636) Could the partial score sum from essential list scores be cached?
Zach Chen created LUCENE-10636: -- Summary: Could the partial score sum from essential list scores be cached? Key: LUCENE-10636 URL: https://issues.apache.org/jira/browse/LUCENE-10636 Project: Lucene - Core Issue Type: Improvement Reporter: Zach Chen This is a follow-up issue from discussion [https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently in the implementation of BlockMaxMaxscoreScorer, there's duplicated computation of summing up scores from essential list scorers. We would like to see if this duplicated computation can be cached without introducing much overhead or data structure that might out-weight the benefit of caching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added
Zach Chen created LUCENE-10635: -- Summary: Ensure test coverage for WANDScorer after additional scorers get added Key: LUCENE-10635 URL: https://issues.apache.org/jira/browse/LUCENE-10635 Project: Lucene - Core Issue Type: Test Reporter: Zach Chen This is a follow-up issue from discussions [https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & [https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] . As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer instead, reducing test coverage for WANDScorer. We would like to see how we can ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating the scorer directly inside the tests? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10411) Add NN vectors support to ExitableDirectoryReader
[ https://issues.apache.org/jira/browse/LUCENE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen resolved LUCENE-10411. Assignee: Zach Chen Resolution: Implemented > Add NN vectors support to ExitableDirectoryReader > - > > Key: LUCENE-10411 > URL: https://issues.apache.org/jira/browse/LUCENE-10411 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > This is currently unsupported. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552556#comment-17552556 ] Zach Chen edited comment on LUCENE-10480 at 6/10/22 5:15 AM: - Hi [~jpountz] , this issue reminded me of our experiments last year implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% improvement for OrHighHigh and OrHighMed queries|https://github.com/apache/lucene/pull/101#issuecomment-840255508] . Do you think we should continue to explore in that direction, or there might be better / simpler algorithms we could look into? was (Author: zacharymorn): Hi [~jpountz] , this issue reminded me of our experiments last year implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% improvement for OrHighHigh and OrHighMed queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do you think we should continue to explore in that direction, or there might be better / simpler algorithms we could look into? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552556#comment-17552556 ] Zach Chen commented on LUCENE-10480: Hi [~jpountz] , this issue reminded me of our experiments last year implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% improvement for OrHighHigh and OrHighMed queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do you think we should continue to explore in that direction, or there might be better / simpler algorithms we could look into? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen resolved LUCENE-10436. Resolution: Done > Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery into a single FieldExistsQuery? > -- > > Key: LUCENE-10436 > URL: https://issues.apache.org/jira/browse/LUCENE-10436 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 6.5h > Remaining Estimate: 0h > > Now that we require consistency across data structures, we could merge > DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require > that the field indexes either norms, doc values or vectors? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen reassigned LUCENE-10436: -- Assignee: Zach Chen > Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery into a single FieldExistsQuery? > -- > > Key: LUCENE-10436 > URL: https://issues.apache.org/jira/browse/LUCENE-10436 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Time Spent: 6.5h > Remaining Estimate: 0h > > Now that we require consistency across data structures, we could merge > DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require > that the field indexes either norms, doc values or vectors? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10411) Add NN vectors support to ExitableDirectoryReader
[ https://issues.apache.org/jira/browse/LUCENE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527308#comment-17527308 ] Zach Chen commented on LUCENE-10411: Hi [~jpountz] , I have created a PR for this. Could you please take a look and let me know your thoughts? > Add NN vectors support to ExitableDirectoryReader > - > > Key: LUCENE-10411 > URL: https://issues.apache.org/jira/browse/LUCENE-10411 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > This is currently unsupported. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?
[ https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512660#comment-17512660 ] Zach Chen commented on LUCENE-10436: Hi [~jpountz] , I took a look and created a PR for this [https://github.com/apache/lucene/pull/767] . Could you please let me know if it looks good to you? > Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery into a single FieldExistsQuery? > -- > > Key: LUCENE-10436 > URL: https://issues.apache.org/jira/browse/LUCENE-10436 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Now that we require consistency across data structures, we could merge > DocValuesFieldExistsQuery, NormsFieldExistsQuery and > KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require > that the field indexes either norms, doc values or vectors? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
[ https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen resolved LUCENE-10236. Resolution: Fixed > CombinedFieldsQuery to use fieldAndWeights.values() when constructing > MultiNormsLeafSimScorer for scoring > - > > Key: LUCENE-10236 > URL: https://issues.apache.org/jira/browse/LUCENE-10236 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > Time Spent: 6h 50m > Remaining Estimate: 0h > > This is a spin-off issue from discussion in > [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a > quick fix in CombinedFieldsQuery scoring. > Currently CombinedFieldsQuery would use a constructed > [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] > object to create a MultiNormsLeafSimScorer for scoring, but the fields > object may contain duplicated field-weight pairs as it is [built from looping > over > fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], > resulting into duplicated norms being added during scoring calculation in > MultiNormsLeafSimScorer. > E.g. for CombinedFieldsQuery with two fields and two values matching a > particular doc: > {code:java} > CombinedFieldQuery query = > new CombinedFieldQuery.Builder() > .addField("field1", (float) 1.0) > .addField("field2", (float) 1.0) > .addTerm(new BytesRef("foo")) > .addTerm(new BytesRef("zoo")) > .build(); {code} > I would imagine the scoring to be based on the following: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) > but the current logic would use the following for scoring: > # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + > freq(field1:zoo) + freq(field2:zoo) > # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + > norm(field2) > > In addition, this differs from how MultiNormsLeafSimScorer is constructed > from CombinedFieldsQuery explain function, which [uses > fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] > and does not contain duplicated field-weight pairs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485556#comment-17485556 ] Zach Chen commented on LUCENE-9662: --- I've approved the null check PR. Thanks [~mdrob] ! For resolving this issue, I think so? So far the implementation has parallelized checking across segments, but within each segment it's still sequential. We initially started from parallelizing within each segment, but had found the speed-up to be limited as its dominated by checking the biggest parts within segment (typically the posting file checked by `testPostings`). We could potentially look into breaking that up to smaller pieces to increase parallelization, but not sure if it's worth the effort / complexity in code. What do you think [~mikemccand] ? > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 20h 50m > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10183) KnnVectorsWriter#writeField should take a KnnVectorsReader, not a VectorValues instance
[ https://issues.apache.org/jira/browse/LUCENE-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457534#comment-17457534 ] Zach Chen commented on LUCENE-10183: Hi [~jpountz] , I've opened a PR for this issue. Please let me know if looks good to you. > KnnVectorsWriter#writeField should take a KnnVectorsReader, not a > VectorValues instance > --- > > Key: LUCENE-10183 > URL: https://issues.apache.org/jira/browse/LUCENE-10183 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > By taking a VectorValues instance, KnnVectorsWriter#write doesn't let > implementations iterate over vectors multiple times if needed. It should take > a KnnVectorReaders similarly to doc values, where the writer takes a > DocValuesProducer. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
[ https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-10236: --- Description: This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. was: This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. > CombinedFieldsQuery to use fieldAndWeights.values() when constructing > MultiNormsLeafSimScorer for scoring > - > > Key: LUCENE-10236 > URL: https://issues.apache.org/jira/browse/LUCENE-10236 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion in > [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a > quick fix in CombinedFieldsQuery scoring. > Currently
[jira] [Created] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
Zach Chen created LUCENE-10236: -- Summary: CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring Key: LUCENE-10236 URL: https://issues.apache.org/jira/browse/LUCENE-10236 Project: Lucene - Core Issue Type: Improvement Components: modules/sandbox Reporter: Zach Chen Assignee: Zach Chen This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444254#comment-17444254 ] Zach Chen commented on LUCENE-10212: No problem [~julietibs] ! Glad to be able to contribute! > Add luceneutil benchmark task for CombinedFieldsQuery > - > > Key: LUCENE-10212 > URL: https://issues.apache.org/jira/browse/LUCENE-10212 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > > This is a spin-off task from > https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively > evaluate performance changes for CombinedFieldsQuery, we would like to add > benchmark task and parsing for CombinedFieldsQuery. > One proposal to the query syntax to enable CombinedFieldsQuery benchmarking > would be the following: > {code:java} > taskName: term1 term2 term3 term4 > +combinedFields=field1^boost1,field2^boost2,field3^boost3 > {code} > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873 ] Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM: -- {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it doesn't get written to BooleanQuery for most of the time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! was (Author: zacharymorn): {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is not used there. In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 50m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873 ] Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM: -- {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it very much doesn't get re-written to BooleanQuery|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! was (Author: zacharymorn): {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is actually not used there ([it doesn't get written to BooleanQuery for most of the time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]). In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 50m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873 ] Zach Chen commented on LUCENE-10061: {quote}Thanks for exploring this area [~zacharymorn]! {quote} No problem, I'm always interested in exploring and learning about lucene querying! {quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of pruning, since Maxscore tends to be have lower overhead than WAND. {quote} I think in my current understanding and testing of CombinedFieldQuery, WANDScorer is not used there. In addition, the PR is already doing Maxscore-like calculation based on competitive impacts to skip docs. Am I missing anything here? {quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes a difference if you try out e.g. 20 and 1 instead. I just looked again at table 3.1 on [https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and the optimal weights that they found for title/body were 38.4/1 on one dataset and 13.5/1 on another dataset. {quote} Sounds good will give that a try! > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 50m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028 ] Zach Chen edited comment on LUCENE-10061 at 11/5/21, 4:50 AM: -- Hi [~jpountz], I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation ([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59]) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following performance test results (perf test index rebuilt to enable norm for title field, task file attached, and luceneutil integration available at [https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:] {code:java} # Run 1 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 4.64 (6.5%) 3.30 (4.7%) -29.0% ( -37% - -19%) 0.000 CFQHighHigh 11.09 (6.0%) 9.61 (6.0%) -13.3% ( -23% - -1%) 0.000 PKLookup 103.38 (4.4%) 108.04 (4.3%) 4.5% ( -4% - 13%) 0.001 CFQHighMedLow 10.58 (6.1%) 12.30 (8.7%) 16.2% ( 1% - 33%) 0.000 CFQHighMed 10.70 (7.4%) 15.51 (11.2%) 44.9% ( 24% - 68%) 0.000 CFQHighLowLow 8.18 (8.2%) 12.87 (11.6%) 57.3% ( 34% - 84%) 0.000 CFQHighLow 14.57 (7.5%) 30.81 (15.1%) 111.4% ( 82% - 144%) 0.000 # Run 2 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.33 (5.7%) 4.02 (7.7%) -24.4% ( -35% - -11%) 0.000 CFQHighLowLow 17.14 (6.2%) 13.06 (5.4%) -23.8% ( -33% - -13%) 0.000 CFQHighMed 17.37 (5.8%) 14.38 (7.7%) -17.2% ( -29% - -3%) 0.000 PKLookup 103.57 (5.5%) 108.84 (5.9%) 5.1% ( -6% - 17%) 0.005 CFQHighMedLow 11.25 (7.2%) 12.70 (9.0%) 12.9% ( -3% - 31%) 0.000 CFQHighHigh 5.00 (6.2%) 7.54 (12.1%) 51.0% ( 30% - 73%) 0.000 CFQHighLow 21.60 (5.2%) 34.57 (14.1%) 60.0% ( 38% - 83%) 0.000 # Run 3 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHighHigh 5.40 (6.9%) 4.06 (5.1%) -24.8% ( -34% - -13%) 0.000 CFQHighMedLow 7.64 (7.4%) 5.79 (6.3%) -24.2% ( -35% - -11%) 0.000 CFQHighHigh 11.11 (7.0%) 9.60 (5.9%) -13.6% ( -24% - 0%) 0.000 CFQHighLowLow 21.21 (7.6%) 21.22 (6.6%) 0.0% ( -13% - 15%) 0.993 PKLookup 103.15 (5.9%) 107.60 (6.9%) 4.3% ( -8% - 18%) 0.034 CFQHighLow 21.85 (8.1%) 34.18 (13.5%) 56.4% ( 32% - 84%) 0.000 CFQHighMed 12.07 (8.4%) 19.98 (16.7%) 65.5% ( 37% - 98%) 0.000 # Run 4 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value CFQHighHigh 8.50 (5.8%) 6.85 (5.2%) -19.5% ( -28% - -8%) 0.000 CFQHighMedLow 10.89 (5.7%) 8.96 (5.4%) -17.8% ( -27% - -7%) 0.000 CFQHighMed 8.41 (5.8%) 7.74 (5.6%) -7.9% ( -18% - 3%) 0.000 CFQHighHighHigh 3.45 (6.7%) 3.38 (5.3%) -2.0% ( -13% - 10%) 0.287 CFQHighLowLow 7.82 (6.4%) 8.20 (7.5%) 4.8% ( -8% - 20%) 0.030 PKLookup 103.50 (5.0%) 110.69 (5.4%) 6.9% ( -3% - 18%) 0.000 CFQHighLow 11.46 (6.0%) 13.16 (6.7%) 14.8% ( 1% - 29%) 0.000 {code} I think overall this shows that the pruning will be most effective when there's a significant difference between terms' frequencies, but will slow things down if they are close, as the cost of pruning outweighs the efficacy of skipping. I'm wondering if we should then gate the pruning by checking the frequencies as well, but from some quick trials that seems to be an expensive operation? Do you have any recommendation for this scenario? was (Author: zacharymorn): Hi [~jpountz], I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation ([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59]) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following
[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028 ] Zach Chen commented on LUCENE-10061: Hi [~jpountz], I've implemented a quick optimization to replace combinatorial calculation with an upper-bound approximation ([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59]) . With this and other bug fixes / optimizations based on CPU profiler, I was able to get the following performance test results (perf test index rebuilt to enable norm for title field, task file attached, and luceneutil integration available at [https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:] {code:java} Run 1 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CFQHighHighHigh4.64 (6.5%)3.30 (4.7%) -29.0% ( -37% - -19%) 0.000 CFQHighHigh 11.09 (6.0%)9.61 (6.0%) -13.3% ( -23% - -1%) 0.000 PKLookup 103.38 (4.4%) 108.04 (4.3%)4.5% ( -4% - 13%) 0.001 CFQHighMedLow 10.58 (6.1%) 12.30 (8.7%) 16.2% ( 1% - 33%) 0.000 CFQHighMed 10.70 (7.4%) 15.51 (11.2%) 44.9% ( 24% - 68%) 0.000 CFQHighLowLow8.18 (8.2%) 12.87 (11.6%) 57.3% ( 34% - 84%) 0.000 CFQHighLow 14.57 (7.5%) 30.81 (15.1%) 111.4% ( 82% - 144%) 0.000 Run 2 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CFQHighHighHigh5.33 (5.7%)4.02 (7.7%) -24.4% ( -35% - -11%) 0.000 CFQHighLowLow 17.14 (6.2%) 13.06 (5.4%) -23.8% ( -33% - -13%) 0.000 CFQHighMed 17.37 (5.8%) 14.38 (7.7%) -17.2% ( -29% - -3%) 0.000 PKLookup 103.57 (5.5%) 108.84 (5.9%)5.1% ( -6% - 17%) 0.005 CFQHighMedLow 11.25 (7.2%) 12.70 (9.0%) 12.9% ( -3% - 31%) 0.000 CFQHighHigh5.00 (6.2%)7.54 (12.1%) 51.0% ( 30% - 73%) 0.000 CFQHighLow 21.60 (5.2%) 34.57 (14.1%) 60.0% ( 38% - 83%) 0.000 Run 3 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CFQHighHighHigh5.40 (6.9%)4.06 (5.1%) -24.8% ( -34% - -13%) 0.000 CFQHighMedLow7.64 (7.4%)5.79 (6.3%) -24.2% ( -35% - -11%) 0.000 CFQHighHigh 11.11 (7.0%)9.60 (5.9%) -13.6% ( -24% -0%) 0.000 CFQHighLowLow 21.21 (7.6%) 21.22 (6.6%)0.0% ( -13% - 15%) 0.993 PKLookup 103.15 (5.9%) 107.60 (6.9%)4.3% ( -8% - 18%) 0.034 CFQHighLow 21.85 (8.1%) 34.18 (13.5%) 56.4% ( 32% - 84%) 0.000 CFQHighMed 12.07 (8.4%) 19.98 (16.7%) 65.5% ( 37% - 98%) 0.000 Run 4 TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value CFQHighHigh8.50 (5.8%)6.85 (5.2%) -19.5% ( -28% - -8%) 0.000 CFQHighMedLow 10.89 (5.7%)8.96 (5.4%) -17.8% ( -27% - -7%) 0.000 CFQHighMed8.41 (5.8%)7.74 (5.6%) -7.9% ( -18% -3%) 0.000 CFQHighHighHigh3.45 (6.7%)3.38 (5.3%) -2.0% ( -13% - 10%) 0.287 CFQHighLowLow7.82 (6.4%)8.20 (7.5%)4.8% ( -8% - 20%) 0.030 PKLookup 103.50 (5.0%) 110.69 (5.4%)6.9% ( -3% - 18%) 0.000 CFQHighLow 11.46 (6.0%) 13.16 (6.7%) 14.8% ( 1% - 29%) 0.000 {code} I think overall this shows that the pruning will be most effective when there's a significant difference between terms' frequencies, but will slow things down if they are close, as the cost of pruning outweighs the efficacy of skipping. I'm wondering if we should then gate the pruning by checking the frequencies as well, but from some quick trials that seems to be an expensive operation? Do you have any recommendation for this scenario? > CombinedFieldsQuery needs dynamic pruning
[jira] [Updated] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-10061: --- Attachment: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks > > Time Spent: 0.5h > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436233#comment-17436233 ] Zach Chen commented on LUCENE-10061: Thanks [~jpountz] for the pointer! I have created a spin-off task for luceneutil integration https://issues.apache.org/jira/browse/LUCENE-10212, and will actually work on it first and circle back to this task afterward. > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-10212: --- Description: This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively evaluate performance changes for CombinedFieldsQuery, we would like to add benchmark task and parsing for CombinedFieldsQuery. One proposal to the query syntax to enable CombinedFieldsQuery benchmarking would be the following: {code:java} taskName: term1 term2 term3 term4 +combinedFields=field1^boost1,field2^boost2,field3^boost3 {code} was: This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively evaluate performance changes for CombinedFieldsQuery, we would like to add benchmark task and parsing for CombinedFieldsQuery. One proposal to the query syntax to enable CombinedFieldsQuery benchmarking would be the following: {code:java} taskName: term1 term2 term3 term4 +combinedFields=field1^boost1,field2^boost2,field3^boost3 {code} > Add luceneutil benchmark task for CombinedFieldsQuery > - > > Key: LUCENE-10212 > URL: https://issues.apache.org/jira/browse/LUCENE-10212 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > > This is a spin-off task from > https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively > evaluate performance changes for CombinedFieldsQuery, we would like to add > benchmark task and parsing for CombinedFieldsQuery. > One proposal to the query syntax to enable CombinedFieldsQuery benchmarking > would be the following: > {code:java} > taskName: term1 term2 term3 term4 > +combinedFields=field1^boost1,field2^boost2,field3^boost3 > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery
Zach Chen created LUCENE-10212: -- Summary: Add luceneutil benchmark task for CombinedFieldsQuery Key: LUCENE-10212 URL: https://issues.apache.org/jira/browse/LUCENE-10212 Project: Lucene - Core Issue Type: Task Reporter: Zach Chen Assignee: Zach Chen This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively evaluate performance changes for CombinedFieldsQuery, we would like to add benchmark task and parsing for CombinedFieldsQuery. One proposal to the query syntax to enable CombinedFieldsQuery benchmarking would be the following: {code:java} taskName: term1 term2 term3 term4 +combinedFields=field1^boost1,field2^boost2,field3^boost3 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435789#comment-17435789 ] Zach Chen commented on LUCENE-10061: Thanks for the confirmation [~jpountz]! I've actually given it a try in the last few days and just opened a WIP PR [https://github.com/apache/lucene/pull/418] for it, before seeing your comment above. >From the results of a few samples (documented in the PR), assuming there's no >bug in the implementation, it does seem that the basic pruning would be most >effective in the overall performance when there's significant difference in >terms' doc frequencies (HighLow), but would indeed slow down when doc >frequencies are close (HighHigh / HighMed) and very likely the overhead of >combinatorial calculation / pruning logic outweighs the benefit of skipping. I >will try to implement your optimization idea above as well and see how it >performs. In addition, I have been searching around to see if I can leverage luceneutil for benchmarking, but I can't seem to find a way to express combined fields query like those in [https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.tasks] . I'm wondering if you may have any pointer for that as well? > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support
[ https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429522#comment-17429522 ] Zach Chen commented on LUCENE-10061: Hi [~jpountz], I'm interested in working on this one, but have a question on its potential implementation and would like to get some advices for it. I found https://issues.apache.org/jira/browse/LUCENE-8312 during research for this, and thought the solution should be very similar here (using merged impacts to prune docs that are not competitive), except for maybe how impacts get merged. However, while I understand for SynonymQuery, impacts can be merged effectively by summing term frequencies for each unique norm value as the impacts all come from the same field, I'm not sure how that could be done efficiently in the case of CombinedFieldsQuery. If I understand it correctly, in order to merge impacts from multiple fields for CombinedFieldsQuery, we may need to compute all the possible summation combinations of competitive \{freq, norm} across all fields, and find again the competitive ones among them. So for the case of 4 fields with a list of 4 competitive impacts each during impacts merge, in the worst case we may need to compute 4 * 4 * 4 * 4 = 256 combinations of merged impacts (\{field1FreqA + field2FreqB + field3FreqC + field4FreqD, field1NormA + field2NormB + field3NormC + field4NormD}), and then filter out the ones that are not competitive. This seems to be inefficient. I'm wondering if you may have any suggestion on this, or if using impacts for CombinedFieldsQuery pruning support is the right approach to begin with? > CombinedFieldsQuery needs dynamic pruning support > - > > Key: LUCENE-10061 > URL: https://issues.apache.org/jira/browse/LUCENE-10061 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, > forcing Lucene to collect all matches in order to figure the top-k hits. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10092) TestCheckIndex failure
[ https://issues.apache.org/jira/browse/LUCENE-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412111#comment-17412111 ] Zach Chen commented on LUCENE-10092: Thanks Michael! I appreciate it! > TestCheckIndex failure > -- > > Key: LUCENE-10092 > URL: https://issues.apache.org/jira/browse/LUCENE-10092 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > We got the below test failure on Elastic's CI: > {noformat} > 10:07:08 org.apache.lucene.index.TestCheckIndex > test suite's output saved > to > /var/lib/jenkins/workspace/apache+lucene-solr+main/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestCheckIndex.txt, > copied below: > 10:07:08> java.lang.AssertionError: expected:<1> but was:<3> > 10:07:08> at > __randomizedtesting.SeedInfo.seed([60A890FDD81D376A:CD05D1B3AF48278E]:0) > 10:07:08> at org.junit.Assert.fail(Assert.java:89) > 10:07:08> at org.junit.Assert.failNotEquals(Assert.java:835) > 10:07:08> at org.junit.Assert.assertEquals(Assert.java:647) > 10:07:08> at org.junit.Assert.assertEquals(Assert.java:633) > 10:07:08> at > org.apache.lucene.index.TestCheckIndex.testCheckIndexAllValid(TestCheckIndex.java:132) > 10:07:08> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 10:07:08> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 10:07:08> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 10:07:08> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > 10:07:08> at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) > 10:07:08> at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > 10:07:08> at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > 10:07:08> at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > 10:07:08> at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > 10:07:08> at org.junit.rules.RunRules.evaluate(RunRules.java:20) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > 10:07:08> at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at >
[jira] [Commented] (LUCENE-10092) TestCheckIndex failure
[ https://issues.apache.org/jira/browse/LUCENE-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412105#comment-17412105 ] Zach Chen commented on LUCENE-10092: Sorry for (another) TestCheckIndex failure! The fix above looks good to me. > TestCheckIndex failure > -- > > Key: LUCENE-10092 > URL: https://issues.apache.org/jira/browse/LUCENE-10092 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > > We got the below test failure on Elastic's CI: > {noformat} > 10:07:08 org.apache.lucene.index.TestCheckIndex > test suite's output saved > to > /var/lib/jenkins/workspace/apache+lucene-solr+main/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestCheckIndex.txt, > copied below: > 10:07:08> java.lang.AssertionError: expected:<1> but was:<3> > 10:07:08> at > __randomizedtesting.SeedInfo.seed([60A890FDD81D376A:CD05D1B3AF48278E]:0) > 10:07:08> at org.junit.Assert.fail(Assert.java:89) > 10:07:08> at org.junit.Assert.failNotEquals(Assert.java:835) > 10:07:08> at org.junit.Assert.assertEquals(Assert.java:647) > 10:07:08> at org.junit.Assert.assertEquals(Assert.java:633) > 10:07:08> at > org.apache.lucene.index.TestCheckIndex.testCheckIndexAllValid(TestCheckIndex.java:132) > 10:07:08> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 10:07:08> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 10:07:08> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 10:07:08> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) > 10:07:08> at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) > 10:07:08> at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > 10:07:08> at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) > 10:07:08> at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) > 10:07:08> at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) > 10:07:08> at org.junit.rules.RunRules.evaluate(RunRules.java:20) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) > 10:07:08> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) > 10:07:08> at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) > 10:07:08> at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > 10:07:08> at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > 10:07:08> at >
[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411723#comment-17411723 ] Zach Chen commented on LUCENE-9662: --- {quote}I think we should backport these changes, in general. They are not breaking – the switch to {{CheckIndexException}} still subclasses {{RuntimeException}}. There will be some Lucene users who are nervous about upgrading to 9.0 too soon, but would be maybe eager to upgrade to last 8.x release (if that's 8.10 or 8.11 or beyond). I think it's bad if we slow down our rate of backporting because a major release is coming ... let's try to review your backport commit carefully to see if it looks OK? {quote} Makes sense. I think my nervousness was also partly due to this change, when backported, might be a bit too close to the 8.10 branch cut window, but it seems like it's ok for us to just backport and release these changes via 8.11 ? For now I've created a PR for backporting them against 8x here https://github.com/apache/lucene-solr/pull/2567. The merge conflict resolution turned out to be less involved than I expected, but there was a failing test and I suspected some unintended code was introduced during merge. I will dig in a bit more to confirm the cause there. > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 19h > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410031#comment-17410031 ] Zach Chen edited comment on LUCENE-9662 at 9/4/21, 7:28 PM: Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but noticed that since changes in this PR touched many places in CheckIndex (the replacement of *RuntimeException* with *CheckIndexException* in particular), and some earlier commits that also touched on CheckIndex were not backported to 8x since they were intended for 9.0 release, the backporting I was trying resulted into many merge conflicts. Although some of the conflicts were easy to resolve, I'm a bit concerned that I may introduce subtle bugs when resolving conflicts for others since I may not be familiar with those. What do you think? Would you recommend we still try to backport these changes to 8x? was (Author: zacharymorn): Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but noticed that since changes in this PR touched many places in CheckIndex (the replacement of *RuntimeException* with *CheckIndexException* in particular), and some earlier commits that also touched on CheckIndex were not backported to 8x since they were intended for 9.0 release, the backporting I was trying resulted into many merge conflicts. Although some of the conflicts were easy to resolve, I'm a bit concerned that I may introduce subtle bugs when resolving conflicts for others since I may not be familiar with those. What do you think? Would you recommend we still try to backport these changes to 8x? > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 18h 10m > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410031#comment-17410031 ] Zach Chen commented on LUCENE-9662: --- Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but noticed that since changes in this PR touched many places in CheckIndex (the replacement of *RuntimeException* with *CheckIndexException* in particular), and some earlier commits that also touched on CheckIndex were not backported to 8x since they were intended for 9.0 release, the backporting I was trying resulted into many merge conflicts. Although some of the conflicts were easy to resolve, I'm a bit concerned that I may introduce subtle bugs when resolving conflicts for others since I may not be familiar with those. What do you think? Would you recommend we still try to backport these changes to 8x? > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 18h 10m > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409182#comment-17409182 ] Zach Chen edited comment on LUCENE-9662 at 9/3/21, 1:06 AM: {quote}Of course, this is on [ridiculously concurrent (256 cores with hyperthreading) hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html], but still it is only using the default 4 concurrent threads right? I'll add an annotation, and increase its concurrency some! {quote} Yes it's indeed capped at 4 threads by default, and the result was indeed impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, I got about 73% processing time reduction when using '-threadCount 12' versus sequential. To increase its concurrency for nightly benchmark, I assume a change can be made in [luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704] to pass in the flag? If so, I can open a PR for it as well! {quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise the new {{-threadCount}} option. [~zacharymorn] could you open a quick followup PR? Thanks! {quote} Ah yes sorry for missing that. I've opened a PR for updating it [https://github.com/apache/lucene/pull/281] was (Author: zacharymorn): {quote}Of course, this is on [ridiculously concurrent (256 cores with hyperthreading) hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html], but still it is only using the default 4 concurrent threads right? I'll add an annotation, and increase its concurrency some! {quote} Yes it's indeed capped at 4 threads by default, and the result was indeed impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, I got about 73% processing time reduction when using '-threadCount 12' versus sequential. To increase its concurrency for nightly benchmark, I assume a change can be made in [luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704] to pass in the flag? If so, I can open a PR for it as well! {quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise the new {{-threadCount}} option. [~zacharymorn] could you open a quick followup PR? Thanks! {quote} Ah yes sorry for missing that. I've opened a PR for updating it https://github.com/apache/lucene/pull/281 > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 18h 10m > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409182#comment-17409182 ] Zach Chen commented on LUCENE-9662: --- {quote}Of course, this is on [ridiculously concurrent (256 cores with hyperthreading) hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html], but still it is only using the default 4 concurrent threads right? I'll add an annotation, and increase its concurrency some! {quote} Yes it's indeed capped at 4 threads by default, and the result was indeed impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, I got about 73% processing time reduction when using '-threadCount 12' versus sequential. To increase its concurrency for nightly benchmark, I assume a change can be made in [luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704] to pass in the flag? If so, I can open a PR for it as well! {quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise the new {{-threadCount}} option. [~zacharymorn] could you open a quick followup PR? Thanks! {quote} Ah yes sorry for missing that. I've opened a PR for updating it https://github.com/apache/lucene/pull/281 > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > Time Spent: 18h 10m > Remaining Estimate: 0h > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409181#comment-17409181 ] Zach Chen commented on LUCENE-9959: --- Hi [~jpountz], sorry for the delay here, somehow I missed the update earlier. +1 for reverting the changes to unblock 9.0 release, I've created a PR here https://github.com/apache/lucene/pull/280 > Can we remove threadlocals of stored fields and term vectors > > > Key: LUCENE-9959 > URL: https://issues.apache.org/jira/browse/LUCENE-9959 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 8.5h > Remaining Estimate: 0h > > [~rmuir] suggested removing these threadlocals at > https://github.com/apache/lucene/pull/137#issuecomment-840111367. > These threadlocals are trappy if you manage many segments and threads within > the same JVM, or worse: non-fixed threadpools. The challenge is to keep the > API easy to use. > We could take advantage of 9.0 to change the stored fields API? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10076) Luke test assertion failure from TestOverviewImpl
Zach Chen created LUCENE-10076: -- Summary: Luke test assertion failure from TestOverviewImpl Key: LUCENE-10076 URL: https://issues.apache.org/jira/browse/LUCENE-10076 Project: Lucene - Core Issue Type: Task Components: luke Reporter: Zach Chen Found a test assertion error from main branch [head|https://github.com/apache/lucene/commit/e470535072edad13b994ded740bf60cd81f510ea] {code:java} org.apache.lucene.luke.models.overview.TestOverviewImpl > test suite's output saved to /Users/xichen/IdeaProjects/lucene/lucene/luke/build/test-results/test/outputs/OUTPUT-org.apache.lucene.luke.models.overview.TestOverviewImpl.txt, copied below: 2> ERROR StatusLogger Could not reconfigure JMX 2> java.security.AccessControlException: access denied ("javax.management.MBeanServerPermission" "createMBeanServer") 2> at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) 2> at java.base/java.security.AccessController.checkPermission(AccessController.java:897) 2> at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322) 2> at java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479) 2> at org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140) 2> at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:629) 2> at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:691) 2> at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:708) 2> at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263) 2> at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153) 2> at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45) 2> at org.apache.logging.log4j.LogManager.getContext(LogManager.java:194) 2> at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:602) 2> at org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:68) 2> at org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:59) 2> at org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60) 2> at org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:49) 2> at org.apache.lucene.luke.models.overview.TestOverviewImpl.testIsOptimized(TestOverviewImpl.java:74) 2> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2> at java.base/java.lang.reflect.Method.invoke(Method.java:566) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992) 2> at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) 2> at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 2> at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) 2> at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) 2> at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) 2> at org.junit.rules.RunRules.evaluate(RunRules.java:20) 2> at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 2> at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370) 2> at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819) 2> at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887) 2> at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898) 2> at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 2> at
[jira] [Created] (LUCENE-10074) Remove unneeded default value assignment
Zach Chen created LUCENE-10074: -- Summary: Remove unneeded default value assignment Key: LUCENE-10074 URL: https://issues.apache.org/jira/browse/LUCENE-10074 Project: Lucene - Core Issue Type: Task Reporter: Zach Chen This is a spin-off issue from discussion here [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we would like to see if there's any automatic checking mechanism (ecj ?) that can be enabled to detect and warn about unneeded default value assignments in future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10071) Review and refactor synchronization handling between MockDirectoryWrapper and CheckIndex
Zach Chen created LUCENE-10071: -- Summary: Review and refactor synchronization handling between MockDirectoryWrapper and CheckIndex Key: LUCENE-10071 URL: https://issues.apache.org/jira/browse/LUCENE-10071 Project: Lucene - Core Issue Type: Task Components: core/index, modules/test-framework Reporter: Zach Chen This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/128,] as we noticed there's a subtle way to cause deadlock in test (or maybe even in production code if similar logic is implemented) [https://github.com/apache/lucene/pull/128#discussion_r642639399.] This issue is to review how synchronization can be improved between these classes to make it less deadlock-prone, or more explicit when locking arrangement needs to be made. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17399305#comment-17399305 ] Zach Chen commented on LUCENE-10002: {quote}Nice [~zacharymorn]! Quite a large change for sure! I took a look at the DrillSideways changes and they appear correct to me at first glance, but I'll see if I can spend more time going through the whole PR in the next couple of days. In the meantime, I went ahead and spun off LUCENE-10050 to track making a similar API change to DrillSideways. {quote} Sounds great, thanks Greg! > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397827#comment-17397827 ] Zach Chen commented on LUCENE-10002: Hi [~jpountz] [~gsmiller], I have created a PR for this to deprecate the collector API in favor of the collector manager API, as well as some initial refactoring to some tests and the parts in DrillSideways that use TopScoreDocCollector & TopFieldCollector to use the latter API. I plan to submit more PRs afterward for other areas in the codebase. Please note that I did first try to remove the collector API entirely, but that ended up resulting in way too many changes than I'm comfortable with in a single PR, and I also feel this API is such a commonly used one that users may prefer a more gradual deprecation / transition period. Hence I reverted my previous effort and adopted a phased approach. > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386614#comment-17386614 ] Zach Chen commented on LUCENE-9959: --- {quote}I had put it on hold to see whether we should explore changing the API like you did rather than still caching stored fields readers per thread but removing as much state as possible like my PR does. {quote} I see. Thanks for the clarification! {quote}If the new API proves controversial, I'd be open to an alternative that would consist of keeping the previous API and pulling a new TermVectorsReader (resp. StoredFieldsReader) internally every time that term vectors (resp. stored fields) are requested instead of the previous approach that consisted of caching instances in a threadlocal. {quote} +1. Do we want to try this different approach for stored field, and see how it compares with the new API for term vector (which may create inconsistency between APIs for the two, but hopefully temporarily) ? > Can we remove threadlocals of stored fields and term vectors > > > Key: LUCENE-9959 > URL: https://issues.apache.org/jira/browse/LUCENE-9959 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 8h 20m > Remaining Estimate: 0h > > [~rmuir] suggested removing these threadlocals at > https://github.com/apache/lucene/pull/137#issuecomment-840111367. > These threadlocals are trappy if you manage many segments and threads within > the same JVM, or worse: non-fixed threadpools. The challenge is to keep the > API easy to use. > We could take advantage of 9.0 to change the stored fields API? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379654#comment-17379654 ] Zach Chen commented on LUCENE-9959: --- Hi [~jpountz], I've merged the PR for term vectors thread local removal, and plan to take on the stored fields one next. I noticed your original PR [https://github.com/apache/lucene/pull/137] that led to this Jira and also touched on stored fields has not been merged yet, do you plan to merge it any time soon, or will you have more changes for it? > Can we remove threadlocals of stored fields and term vectors > > > Key: LUCENE-9959 > URL: https://issues.apache.org/jira/browse/LUCENE-9959 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > [~rmuir] suggested removing these threadlocals at > https://github.com/apache/lucene/pull/137#issuecomment-840111367. > These threadlocals are trappy if you manage many segments and threads within > the same JVM, or worse: non-fixed threadpools. The challenge is to keep the > API easy to use. > We could take advantage of 9.0 to change the stored fields API? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10018) Remove Fields from TermVector reader related usage
[ https://issues.apache.org/jira/browse/LUCENE-10018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379647#comment-17379647 ] Zach Chen commented on LUCENE-10018: Hi [~dsmiley], just to provide a quick update, I've merged the TermVectors PR for LUCENE-9959. > Remove Fields from TermVector reader related usage > -- > > Key: LUCENE-10018 > URL: https://issues.apache.org/jira/browse/LUCENE-10018 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/index >Reporter: Zach Chen >Assignee: David Smiley >Priority: Minor > > This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for > Fields class deprecation / removal in TermVector reader usage. As Fields > class is generally meant as internal class reserved for posting index, we > would like to have some dedicated TermVector abstractions and APIs instead. > The relevant discussions are available here: > * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076] > * [https://github.com/apache/lucene/pull/180#issuecomment-863254651] > * [https://github.com/apache/lucene/pull/180#issuecomment-863262562] > * [https://github.com/apache/lucene/pull/180#issuecomment-863775298] > * [https://github.com/apache/lucene/pull/180#issuecomment-864720190] > * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901] > * [https://github.com/apache/lucene/pull/180#issuecomment-871155896] > * [https://github.com/apache/lucene/pull/180#issuecomment-871922823] > > One potential API design for this can be found here > [https://github.com/apache/lucene/pull/180#issuecomment-871155896] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10018) Remove Fields from TermVector reader related usage
[ https://issues.apache.org/jira/browse/LUCENE-10018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373895#comment-17373895 ] Zach Chen commented on LUCENE-10018: Sounds good, thanks David! > Remove Fields from TermVector reader related usage > -- > > Key: LUCENE-10018 > URL: https://issues.apache.org/jira/browse/LUCENE-10018 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs, core/index >Reporter: Zach Chen >Assignee: David Smiley >Priority: Minor > > This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for > Fields class deprecation / removal in TermVector reader usage. As Fields > class is generally meant as internal class reserved for posting index, we > would like to have some dedicated TermVector abstractions and APIs instead. > The relevant discussions are available here: > * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076] > * [https://github.com/apache/lucene/pull/180#issuecomment-863254651] > * [https://github.com/apache/lucene/pull/180#issuecomment-863262562] > * [https://github.com/apache/lucene/pull/180#issuecomment-863775298] > * [https://github.com/apache/lucene/pull/180#issuecomment-864720190] > * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901] > * [https://github.com/apache/lucene/pull/180#issuecomment-871155896] > * [https://github.com/apache/lucene/pull/180#issuecomment-871922823] > > One potential API design for this can be found here > [https://github.com/apache/lucene/pull/180#issuecomment-871155896] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10018) Remove Fields from TermVector reader related usage
Zach Chen created LUCENE-10018: -- Summary: Remove Fields from TermVector reader related usage Key: LUCENE-10018 URL: https://issues.apache.org/jira/browse/LUCENE-10018 Project: Lucene - Core Issue Type: Task Components: core/codecs, core/index Reporter: Zach Chen This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for Fields class deprecation / removal in TermVector reader usage. As Fields class is generally meant as internal class reserved for posting index, we would like to have some dedicated TermVector abstractions and APIs instead. The relevant discussions are available here: * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076] * [https://github.com/apache/lucene/pull/180#issuecomment-863254651] * [https://github.com/apache/lucene/pull/180#issuecomment-863262562] * [https://github.com/apache/lucene/pull/180#issuecomment-863775298] * [https://github.com/apache/lucene/pull/180#issuecomment-864720190] * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901] * [https://github.com/apache/lucene/pull/180#issuecomment-871155896] * [https://github.com/apache/lucene/pull/180#issuecomment-871922823] One potential API design for this can be found here [https://github.com/apache/lucene/pull/180#issuecomment-871155896] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors
[ https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362392#comment-17362392 ] Zach Chen commented on LUCENE-9959: --- I took a look at this issue and the idea suggested by Robert (and https://issues.apache.org/jira/browse/LUCENE-1195 that seems to introduce thread local originally), and gave it a try with this WIP PR [https://github.com/apache/lucene/pull/180] (with commit [https://github.com/apache/lucene/commit/5062e4d69938f104b461004022e19c10a65960a5] that has the most meaningful changes). Is the implementation what you are expecting? I feel since `IndexReader` already has APIs _getTermVectors_ and _getTermVector_, it might not be too bad to add a new API to go alongside with them, and gradually phased out the use of the existing two (at least for term vector)? In addition, I'm a bit wondering why other readers from SegmentReader don't need to use the same thread local approach for concurrency / caching (namely, the PointsReader, NormsProducer, DocValuesProducer, VectorReader, FieldsProducer in SegmentReader). I'm guessing these readers' operations might be much less costly compared with term vector reader and stored field reader, so their operations are made thread-safe internally? I'll dig around to understand more about the context there... > Can we remove threadlocals of stored fields and term vectors > > > Key: LUCENE-9959 > URL: https://issues.apache.org/jira/browse/LUCENE-9959 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > [~rmuir] suggested removing these threadlocals at > https://github.com/apache/lucene/pull/137#issuecomment-840111367. > These threadlocals are trappy if you manage many segments and threads within > the same JVM, or worse: non-fixed threadpools. The challenge is to keep the > API easy to use. > We could take advantage of 9.0 to change the stored fields API? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360555#comment-17360555 ] Zach Chen commented on LUCENE-9976: --- {quote}[~zacharymorn] I believe that the same problem exists on branch_8x and branch_8_9, let's backport your fix? {quote} Ah yes! I've opened two new PRs for backporting: # branch_8x: [https://github.com/apache/lucene-solr/pull/2512] (with a small comment) # branch_8_9: [https://github.com/apache/lucene-solr/pull/2511] > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Assignee: Zach Chen >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen resolved LUCENE-9976. --- Resolution: Fixed > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Assignee: Zach Chen >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen reassigned LUCENE-9976: - Assignee: Zach Chen > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Assignee: Zach Chen >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359001#comment-17359001 ] Zach Chen commented on LUCENE-9976: --- No worry [~jpountz], and hope you had a great vacation! I'm looking forward to mine coming up in a few weeks! :D {quote}It's a bit worrying that this bug only got caught by TestExpressionSorts, I wonder why the test cases we have in TestWANDScorer didn't catch it. {quote} That's a great call. I played around with the tests there a bit and came up with one new test that would fail around 80% of the time (not sure if there's clause ordering or other randomness kicked in) without the fix. From that, I think the _ConstantScoreQuery_ used heavily in those tests might have masked the issue a bit? > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358027#comment-17358027 ] Zach Chen commented on LUCENE-9976: --- For the time being, I've gone ahead and created a PR to update the assertion https://github.com/apache/lucene/pull/171 > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357046#comment-17357046 ] Zach Chen edited comment on LUCENE-9976 at 6/4/21, 4:13 AM: {quote}I'm using mac, and trying with main branch head commit a6cf46dad {quote} Okay I should have also tried to pull the latest main branch before running the tests, and after that I'm also able to consistently reproduce this failure. Sorry for the confusion earlier! The failure happened at this line: {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code} I reset the commits a few times to see where it started to fail, and believed it started from the performance regression fix commit 820e63d2ddf235c from https://issues.apache.org/jira/browse/LUCENE-9958 . The change was {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java index f33af6b8ee8..f5bab49fb71 100644 --- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java +++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java @@ -548,7 +548,7 @@ final class WANDScorer extends Scorer { /** Insert an entry in 'tail' and evict the least-costly scorer if full. */ private DisiWrapper insertTailWithOverFlow(DisiWrapper s) { -if (tailMaxScore + s.maxScore < minCompetitiveScore) { +if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < minShouldMatch) { // we have free room for this new entry addTail(s); tailMaxScore += s.maxScore; {code} I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to happen now, since the block may be entered from condition _tailSize + 1 < minShouldMatch._ So the assertion logic should be updated to the following (tested locally and passed the test): {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || tailSize < minShouldMatch{code} I can raise a quick PR if that looks good? [~jpountz] was (Author: zacharymorn): {quote}I'm using mac, and trying with main branch head commit a6cf46dad {quote} Okay I should have also tried to pull the latest main branch before running the tests, and after that I'm also able to consistently reproduce this failure. Sorry for the confusion earlier! The failure happened at this line: {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code} I reset the commits a few times to see where it started to fail, and believed it started from the performance regression fix commit 820e63d2ddf235c from https://issues.apache.org/jira/browse/LUCENE-9958 . The change was {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java index f33af6b8ee8..f5bab49fb71 100644 --- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java +++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java @@ -548,7 +548,7 @@ final class WANDScorer extends Scorer { /** Insert an entry in 'tail' and evict the least-costly scorer if full. */ private DisiWrapper insertTailWithOverFlow(DisiWrapper s) { -if (tailMaxScore + s.maxScore < minCompetitiveScore) { +if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < minShouldMatch) { // we have free room for this new entry addTail(s); tailMaxScore += s.maxScore; {code} I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to happen now, since the block may be entered from condition _tailSize + 1 < minShouldMatch._ So the assertion logic should be updated to the following (tested locally and passed the test): {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || tailSize < minShouldMatch{code} I can raise a quick PR if that looks good? [~jpountz] > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357046#comment-17357046 ] Zach Chen commented on LUCENE-9976: --- {quote}I'm using mac, and trying with main branch head commit a6cf46dad {quote} Okay I should have also tried to pull the latest main branch before running the tests, and after that I'm also able to consistently reproduce this failure. Sorry for the confusion earlier! The failure happened at this line: {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code} I reset the commits a few times to see where it started to fail, and believed it started from the performance regression fix commit 820e63d2ddf235c from https://issues.apache.org/jira/browse/LUCENE-9958 . The change was {code:java} diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java index f33af6b8ee8..f5bab49fb71 100644 --- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java +++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java @@ -548,7 +548,7 @@ final class WANDScorer extends Scorer { /** Insert an entry in 'tail' and evict the least-costly scorer if full. */ private DisiWrapper insertTailWithOverFlow(DisiWrapper s) { -if (tailMaxScore + s.maxScore < minCompetitiveScore) { +if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < minShouldMatch) { // we have free room for this new entry addTail(s); tailMaxScore += s.maxScore; {code} I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to happen now, since the block may be entered from condition _tailSize + 1 < minShouldMatch._ So the assertion logic should be updated to the following (tested locally and passed the test): {code:java} assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || tailSize < minShouldMatch{code} I can raise a quick PR if that looks good? [~jpountz] > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356092#comment-17356092 ] Zach Chen commented on LUCENE-9976: --- Hi Dawid and Michael! I tried again with the command line above with 1000 iterations, but it still didn't reproduce for me for some reasons. {code:java} xichen@Xis-MacBook-Pro lucene % ./gradlew test -Ptests.iters=1000 --tests TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be reused, use --status for details > Task :randomizationInfo Running tests with randomization seed: tests.seed=FF571CE915A0955 > Task :lucene:expressions:test :lucene:expressions:test (SUCCESS): 1000 test(s) The slowest tests (exceeding 500 ms) during this run: 6.62s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:159F353910AC3564]} (:lucene:expressions) 6.56s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:993EFB36FB8A23F3]} (:lucene:expressions) 6.22s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:C9E931CFB8A6C82E]} (:lucene:expressions) 6.21s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:2854FA7396FAF62F]} (:lucene:expressions) 5.84s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:5515E173B4FD16BA]} (:lucene:expressions) 5.65s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:A8C1890BB457C90F]} (:lucene:expressions) 5.62s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:A44F7F3F8B79B2DB]} (:lucene:expressions) 5.57s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:328FA3364F99C839]} (:lucene:expressions) 5.56s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:9D8BCE5B3371B6E2]} (:lucene:expressions) 5.55s TestExpressionSorts.testQueries {seed=[FF571CE915A0955:2E635F6265446CED]} (:lucene:expressions) The slowest suites (exceeding 1s) during this run: 2662.21s TestExpressionSorts (:lucene:expressions) BUILD SUCCESSFUL in 45m 1s{code} > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent
[ https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355476#comment-17355476 ] Zach Chen commented on LUCENE-9976: --- Hmm this test actually passed for me: {code:java} xichen@Xis-MacBook-Pro lucene % ./gradlew test --tests TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be reused, use --status for details > Task :randomizationInfo Running tests with randomization seed: tests.seed=FF571CE915A0955 BUILD SUCCESSFUL in 37s {code} I'm using mac, and trying with main branch head commit a6cf46dad > WANDScorer assertion error in ensureConsistent > -- > > Key: LUCENE-9976 > URL: https://issues.apache.org/jira/browse/LUCENE-9976 > Project: Lucene - Core > Issue Type: Bug >Reporter: Dawid Weiss >Priority: Major > > Build fails and is reproducible: > https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console > {code} > ./gradlew test --tests TestExpressionSorts.testQueries > -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true > -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9984) Make CheckIndex doChecksumsOnly / -fast as default
Zach Chen created LUCENE-9984: - Summary: Make CheckIndex doChecksumsOnly / -fast as default Key: LUCENE-9984 URL: https://issues.apache.org/jira/browse/LUCENE-9984 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 9.0 Reporter: Zach Chen Assignee: Zach Chen This issue is a spin-off from discussion in https://github.com/apache/lucene/pull/128 Currently _CheckIndex_ defaults to checking both checksum as well as content inside each segment files for correctness, and requires _-fast_ flag to be explicitly passed in to do checksum only. However, this default setting was there due to lack of checksum feature historically, and is slow for most end-users nowadays as they probably only care about their indices being intact (from random bit flipping for example). This issue is to change the default settings for CheckIndex so that they are more appropriate for end-users. One proposal from @rmuir is the following: # Make {{-fast}} the new default. # The previous {{-slow}} could be moved to {{-slower}} # The current behavior (checksum + segment file content - slow check) could be activated by {{-slow}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349941#comment-17349941 ] Zach Chen edited comment on LUCENE-9335 at 5/23/21, 7:13 AM: - Hi [~jpountz], I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include: # Move scorer from essential to non-essential list when minCompetitiveScore increases (mentioned in the paper) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Use score.score instead of maxScore for candidate doc evaluation against minCompetitiveScore to prune more docs (reverting your previous optimization) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Reduce maxScore contribution from non-essential list during candidate doc evaluation for scorer that cannot match ## commit: [https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d] # Use the maximum of each scorer's upTo for maxScore boundary instead of minimum (opposed to what the paper suggested) ## commit: [https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188] ## This causes OrMedMedMedMedMed to degrade by 40% Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4 changes). For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability): *BMW Scorer* {code:java} AVG: 23,252,992.375 P25: 6,298,463 P50: 13,007,148 P75: 26,868,222 P90: 56,683,505 P95: 84,333,397 P99: 154,185,321 Collected AVG: 8,168.523 Collected P25: 1,548 Collected P50: 2,259 Collected P75: 3,735 Collected P90: 6,228 Collected P95: 13,063 Collected P99: 221,894{code} *BMM Scorer* {code:java} AVG: 41,970,641.638 P25: 8,654,210 P50: 21,553,366 P75: 51,519,172 P90: 109,510,378 P95: 154,534,017 P99: 266,141,446 Collected AVG: 16,810.392 Collected P25: 2,769 Collected P50: 7,159 Collected P75: 20,077 Collected P90: 43,031 Collected P95: 69,984 Collected P99: 135,253 {code} I've also attached "JFR result for BMM scorer with optimizations May 22" for the BMM scorer profiling result from the latest changes. Overall, it seems that the larger number of docs collected by BMM is becoming a bottleneck for performance, as around 50% of the computation was spent by SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute score for candidate doc (around 34% of the computation was spent to find the next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more docs faster, it should be able to improve BMM further. was (Author: zacharymorn): Hi [~jpountz], I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include: # Move scorer from essential to non-essential list when minCompetitiveScore increases (mentioned in the paper) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Use score.score instead of maxScore for candidate doc evaluation against minCompetitiveScore to prune more docs (reverting your previous optimization) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Reduce maxScore contribution from non-essential list during candidate doc evaluation for scorer that cannot match ## commit: [https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d] # Use the maximum of each scorer's upTo for maxScore boundary instead of minimum (opposed to what the paper suggested) ## commit: [https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188] ## This causes OrMedMedMedMedMed to degrade by 40% Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4 changes). For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability): *BMW Scorer* {code:java}
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349941#comment-17349941 ] Zach Chen commented on LUCENE-9335: --- Hi [~jpountz], I've tried out a few ideas in the last few days and they gave some improvements (but also made it worse for OrMedMedMedMedMed). However, it was still not as performing as BMW for MSMARCO passages dataset. The ideas I tried include: # Move scorer from essential to non-essential list when minCompetitiveScore increases (mentioned in the paper) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Use score.score instead of maxScore for candidate doc evaluation against minCompetitiveScore to prune more docs (reverting your previous optimization) ## commit: **[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]] # Reduce maxScore contribution from non-essential list during candidate doc evaluation for scorer that cannot match ## commit: [https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d] # Use the maximum of each scorer's upTo for maxScore boundary instead of minimum (opposed to what the paper suggested) ## commit: [https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188] ## This causes OrMedMedMedMedMed to degrade by 40% Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of OrMedMedMedMedMed performance (by -40% with #4 changes). For MSMARCO passages dataset, they now give the following results (modified slightly from your version to show more percentile, and to add comma to separate digits for readability): *BMW Scorer* {code:java} AVG: 23,252,992.375 P25: 6,298,463 P50: 13,007,148 P75: 26,868,222 P90: 56,683,505 P95: 84,333,397 P99: 154,185,321 Collected AVG: 8,168.523 Collected P25: 1,548 Collected P50: 2,259 Collected P75: 3,735 Collected P90: 6,228 Collected P95: 13,063 Collected P99: 221,894{code} *BMM Scorer* {code:java} AVG: 41,970,641.638 P25: 8,654,210 P50: 21,553,366 P75: 51,519,172 P90: 109,510,378 P95: 154,534,017 P99: 266,141,446 Collected AVG: 16,810.392 Collected P25: 2,769 Collected P50: 7,159 Collected P75: 20,077 Collected P90: 43,031 Collected P95: 69,984 Collected P99: 135,253 {code} I've also attached "JFR result for BMM scorer with optimizations May 22" for the BMM scorer profiling result from the latest changes. Overall, it seems that the larger number of docs collected by BMM is becoming a bottleneck for performance, as around 50% of the computation was spent by SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute score for candidate doc (around 34% of the computation was spent to find the next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more docs faster, it should be able to improve BMM further. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: JFR result for BMM scorer with optimizations May 22.png, > MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-9335: -- Attachment: JFR result for BMM scorer with optimizations May 22.png > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: JFR result for BMM scorer with optimizations May 22.png, > MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348046#comment-17348046 ] Zach Chen commented on LUCENE-9335: --- {quote}Actually this matches my expectation. BMM and BMW differ in that BMM only makes a decision about which scorers lead iteration once per block, while BMW needs to make decisions on every document. So BMM collects more documents than BMW but BMW takes the risk that trying to be too smart makes things slower than a simpler approach. {quote} Ok I also took a further look at the TopDocsCollector code, and confirmed that I had an incorrect understanding of "collect" and "hit count" here earlier. This (and Michael's earlier response) totally makes sense now! {quote}Yes. You can download the "Collection" and "Queries" files from [https://microsoft.github.io/msmarco/#ranking] (make sure to accept terms at the top first so that download links are active). {quote} Thanks! I was able to download them. Will explore a bit more to see how they can be improved further. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362 ] Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM: - {quote}The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster? {quote} Ah not sure why I didn't think of running them through BMM earlier! I just gave them a run, and got the following results: *BMM Scorer* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.46 (24.7%) 17.63 (11.6%) -42.1% ( -62% - -7%) 0.000 Fuzzy2 21.61 (16.4%) 16.28 (12.0%) -24.7% ( -45% -4%) 0.000 PKLookup 216.72 (4.1%) 215.63 (3.0%) -0.5% ( -7% -6%) 0.654 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.58 (9.1%) 22.12 (6.4%) -27.7% ( -39% - -13%) 0.000 Fuzzy2 36.07 (12.7%) 27.05 (10.8%) -25.0% ( -42% - -1%) 0.000 PKLookup 215.26 (3.4%) 213.99 (2.5%) -0.6% ( -6% -5%) 0.530{code} *BMMBulkScorer without window (with the above scorer implementation)* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 16.32 (22.6%) 15.68 (16.3%) -3.9% ( -34% - 45%) 0.527 Fuzzy1 48.11 (17.6%) 47.48 (13.6%) -1.3% ( -27% - 36%) 0.791 PKLookup 213.67 (3.2%) 212.52 (4.0%) -0.5% ( -7% -6%) 0.640 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 26.99 (23.2%) 24.75 (13.6%) -8.3% ( -36% - 37%) 0.169 PKLookup 216.27 (4.3%) 216.43 (3.4%) 0.1% ( -7% -8%) 0.951 Fuzzy1 19.01 (24.2%) 20.01 (14.2%) 5.3% ( -26% - 57%) 0.400 {code} *BMMBulkScorer with window size 1024* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 23.56 (26.0%) 19.08 (13.9%) -19.0% ( -46% - 28%) 0.004 Fuzzy1 30.97 (31.6%) 25.82 (16.9%) -16.6% ( -49% - 46%) 0.038 PKLookup 213.23 (2.5%) 211.63 (1.8%) -0.7% ( -5% -3%) 0.289 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 20.59 (12.1%) 20.59 (10.5%) -0.0% ( -20% - 25%) 0.994 PKLookup 205.21 (3.1%) 206.99 (3.7%) 0.9% ( -5% -7%) 0.422 Fuzzy2 30.74 (22.7%) 32.71 (17.0%) 6.4% ( -27% - 59%) 0.311 {code} These results look strange to me actually, as I would imagine the BulkScorer without window one to perform similarly with the scorer one, as it was just using the scorer implementation under the hood. I'll need to dive into it more to understand what contributed to these difference (their JFR CPU recordings look similar too). >From the results I got now, it seems BMM may not be ideal for handling queries >with many terms. My high level guess is that with these queries that can be >rewritten into boolean queries with ~50 terms, BMM may find itself spending >lots of time to compute upTo and update maxScore, as the minimum of all block >boundaries of scorers were used to update upTo each time. This can explain why >the bulkScorer implementation with a fixed window size has better performance >than the scorer one, but doesn't explain the difference above. {quote}I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers: Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM? {quote} Thanks for running these additional tests! The results indeed look interesting. I took a look at the MSMarcoPassages.java code you attached, and wonder if it's also possible that, since the percentile numbers were computed after sort, for some
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362 ] Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM: - {quote}The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster? {quote} Ah not sure why I didn't think of running them through BMM earlier! I just gave them a run, and got the following results: *BMM Scorer* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.46 (24.7%) 17.63 (11.6%) -42.1% ( -62% - -7%) 0.000 Fuzzy2 21.61 (16.4%) 16.28 (12.0%) -24.7% ( -45% -4%) 0.000 PKLookup 216.72 (4.1%) 215.63 (3.0%) -0.5% ( -7% -6%) 0.654 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.58 (9.1%) 22.12 (6.4%) -27.7% ( -39% - -13%) 0.000 Fuzzy2 36.07 (12.7%) 27.05 (10.8%) -25.0% ( -42% - -1%) 0.000 PKLookup 215.26 (3.4%) 213.99 (2.5%) -0.6% ( -6% -5%) 0.530{code} *BMMBulkScorer without window (with the above scorer implementation)* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 16.32 (22.6%) 15.68 (16.3%) -3.9% ( -34% - 45%) 0.527 Fuzzy1 48.11 (17.6%) 47.48 (13.6%) -1.3% ( -27% - 36%) 0.791 PKLookup 213.67 (3.2%) 212.52 (4.0%) -0.5% ( -7% -6%) 0.640 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 26.99 (23.2%) 24.75 (13.6%) -8.3% ( -36% - 37%) 0.169 PKLookup 216.27 (4.3%) 216.43 (3.4%) 0.1% ( -7% -8%) 0.951 Fuzzy1 19.01 (24.2%) 20.01 (14.2%) 5.3% ( -26% - 57%) 0.400 {code} *BMMBulkScorer with window size 1024* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 23.56 (26.0%) 19.08 (13.9%) -19.0% ( -46% - 28%) 0.004 Fuzzy1 30.97 (31.6%) 25.82 (16.9%) -16.6% ( -49% - 46%) 0.038 PKLookup 213.23 (2.5%) 211.63 (1.8%) -0.7% ( -5% -3%) 0.289 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 20.59 (12.1%) 20.59 (10.5%) -0.0% ( -20% - 25%) 0.994 PKLookup 205.21 (3.1%) 206.99 (3.7%) 0.9% ( -5% -7%) 0.422 Fuzzy2 30.74 (22.7%) 32.71 (17.0%) 6.4% ( -27% - 59%) 0.311 {code} These results look strange to me actually, as I would imagine the BulkScorer without window one to perform similarly with the scorer one, as it was just using the scorer implementation under the hood. I'll need to dive into it more to understand what contributed to these difference (their JFR CPU recordings look similar too). >From the results I got now, it seems BMM may not be ideal for handling queries >with many terms. My high level guess is that with these queries that can be >rewritten into boolean queries with ~50 terms, BMM may find itself spending >lots of time to compute upTo and update maxScore, as the minimum of all block >boundaries of scorers were used to update upTo each time. This can explain why >the bulkScorer implementation with a fixed window size has better performance >than the scorer one, but doesn't explain the difference above. {quote}I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers: Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM? {quote} Thanks for running these additional tests! The results indeed look interesting. I took a look at the MSMarcoPassages.java code you attached, and wonder if it's also possible that, since the percentile numbers were computed after sort, for
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362 ] Zach Chen commented on LUCENE-9335: --- {quote}The speedup for some of the slower queries looks great. I know Fuzzy1 and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your change makes them faster? {quote} Ah not sure why I didn't think of running them through BMM earlier! I just gave them a run, and got the following results: *BMM Scorer* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.46 (24.7%) 17.63 (11.6%) -42.1% ( -62% - -7%) 0.000 Fuzzy2 21.61 (16.4%) 16.28 (12.0%) -24.7% ( -45% -4%) 0.000 PKLookup 216.72 (4.1%) 215.63 (3.0%) -0.5% ( -7% -6%) 0.654 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 30.58 (9.1%) 22.12 (6.4%) -27.7% ( -39% - -13%) 0.000 Fuzzy2 36.07 (12.7%) 27.05 (10.8%) -25.0% ( -42% - -1%) 0.000 PKLookup 215.26 (3.4%) 213.99 (2.5%) -0.6% ( -6% -5%) 0.530{code} *BMMBulkScorer without window (with the above scorer implementation)* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 16.32 (22.6%) 15.68 (16.3%) -3.9% ( -34% - 45%) 0.527 Fuzzy1 48.11 (17.6%) 47.48 (13.6%) -1.3% ( -27% - 36%) 0.791 PKLookup 213.67 (3.2%) 212.52 (4.0%) -0.5% ( -7% -6%) 0.640 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 26.99 (23.2%) 24.75 (13.6%) -8.3% ( -36% - 37%) 0.169 PKLookup 216.27 (4.3%) 216.43 (3.4%) 0.1% ( -7% -8%) 0.951 Fuzzy1 19.01 (24.2%) 20.01 (14.2%) 5.3% ( -26% - 57%) 0.400 {code} *BMMBulkScorer with window size 1024* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy2 23.56 (26.0%) 19.08 (13.9%) -19.0% ( -46% - 28%) 0.004 Fuzzy1 30.97 (31.6%) 25.82 (16.9%) -16.6% ( -49% - 46%) 0.038 PKLookup 213.23 (2.5%) 211.63 (1.8%) -0.7% ( -5% -3%) 0.289 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value Fuzzy1 20.59 (12.1%) 20.59 (10.5%) -0.0% ( -20% - 25%) 0.994 PKLookup 205.21 (3.1%) 206.99 (3.7%) 0.9% ( -5% -7%) 0.422 Fuzzy2 30.74 (22.7%) 32.71 (17.0%) 6.4% ( -27% - 59%) 0.311 {code} These results look strange to me actually, as I would imagine the BulkScorer without window one to perform similarly with the scorer one, as it was just using the scorer implementation under the hood. I'll need to dive into it more to understand what contributed to these difference (their JFR CPU recordings look similar too). >From the results I got now, it seems BMM may not be ideal for handling queries >with many terms. My high level guess is that with these queries that can be >rewritten into boolean queries with ~50 terms, BMM may find itself spending >lots of time to compute upTo and update maxScore, as the minimum of all block >boundaries of scorers were used to update upTo each time. This can explain why >the bulkScorer implementation with a fixed window size has better performance >than the scorer one, but doesn't explain the difference above. {quote}I wanted to do some more tests so I played with the MSMARCO passages dataset, which has the interesting property of having queries that have several terms (often around 8-10). See the attached benchmark if you are interested, here are the outputs I'm getting for various scorers: Contrary to my intuition, WAND seems to perform better despite the high number of terms. I wonder if there are some improvements we can still make to BMM? {quote} Thanks for running these additional tests! The results indeed look interesting. I took a look at the MSMarcoPassages.java code you attached, and wonder if it's also possible that, since the percentile numbers were computed after sort, for some low percentile (P10 for example)
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346597#comment-17346597 ] Zach Chen commented on LUCENE-9335: --- [~jpountz] what do you think about the results we got so far? If we are good with the trade-off and performance improvement BMM has for _OrHighHigh_ and __ _OrHighMed_ queries, I can work on productizing the changes next. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345793#comment-17345793 ] Zach Chen commented on LUCENE-9335: --- I made some changes to the BulkScorer implementations to return false for BMM eligibility immediately when non term query was identified, and they improved the benchmark results for Fuzzy1 & Fuzzy2 a bit ([https://github.com/apache/lucene/pull/113/commits/f4115f78be0833b65694ad6a0f9f4f32565091e7).] However, it appears that Fuzzy1 & Fuzzy2 benchmark results would vary more in general across runs / queries used compared to other tasks. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345575#comment-17345575 ] Zach Chen commented on LUCENE-9335: --- I see why Fuzzy1 & Fuzzy2 did not trigger BMM scorer / bulkScorer now. Those queries were rewritten into boolean queries with boosting (BoostQuery), but in the BMM eligibility check I had check for TermQuery directly [https://github.com/apache/lucene/pull/113/files#diff-d500c30048128831b0fe3c53d9bb74eed7d8063e81d33737b26dcd00bc7f1fd2R337] , hence the BMM scorer / bulkScorer were not invoked for them. Also likely the looping in that check hurt performance for both implementations, as fuzzy queries can expand into ones with many subqueries (one instance I saw was 50 subqueries), and the current logic would go through all subqueries. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345564#comment-17345564 ] Zach Chen edited comment on LUCENE-9335 at 5/15/21, 6:57 PM: - {quote}Are you sure? I believe that fuzzy queries rewrite to boolean queries, so they would use your new block-max maxscore under the hood? {quote} Hmm I verified that by throwing runtime exception in the BMM BulkScorer's constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which completed successfully. I feel the slow down may come from the checks to see if BMM is applicable. Let me take a further look there. was (Author: zacharymorn): > Are you sure? I believe that fuzzy queries rewrite to boolean queries, so >they would use your new block-max maxscore under the hood? Hmm I verified that by throwing runtime exception in the BMM BulkScorer's constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which completed successfully. I feel the slow down may come from the checks to see if BMM is applicable. Let me take a further look there. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345564#comment-17345564 ] Zach Chen commented on LUCENE-9335: --- > Are you sure? I believe that fuzzy queries rewrite to boolean queries, so >they would use your new block-max maxscore under the hood? Hmm I verified that by throwing runtime exception in the BMM BulkScorer's constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which completed successfully. I feel the slow down may come from the checks to see if BMM is applicable. Let me take a further look there. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344973#comment-17344973 ] Zach Chen commented on LUCENE-9335: --- Just want to provide a quick summary of the latest progress of this issue. Currently there are 3 different BMM implementations from 2 PRs: # Scorer based implementation ## PR: [https://github.com/apache/lucene/pull/101] ## wikibigall benchmark results: [https://github.com/apache/lucene/pull/101#issuecomment-840255508] ### On average it improves _OrHighHigh_ by 40%+, and _OrHighMed_ around 20% ### 1 out of 3 runs it hurt _AndMedOrHighHigh_ and _OrHighMed_ performance by around 16% # BulkScorer based implementation with fixed window size ## PR: [https://github.com/apache/lucene/pull/113] ## wikibigall benchmark with window size 1024 results: [https://github.com/apache/lucene/pull/113#issuecomment-840293637] ### On average it improves _OrHighHigh_ by 3-8%, and _OrHighMed_ by 23%+ ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance by around 8%, even though it wasn't used for those queries # BulkScorer based implementation without window, and using the scorer implementation from #1 ## Commit: [https://github.com/zacharymorn/lucene/commit/3bcdbb31a7d55b00cb53e4be40a4adc93b9f30db] ## wikibigall benchmark results: [https://github.com/apache/lucene/pull/113#discussion_r631568912] ### On average it improves _OrHighHigh by 52%, and_ _OrHighMed 10% - 18%_ ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by around 8%-13%, even though it wasn't used for those queries [~jpountz] what do you think about the above results as well as the latest changes, and any other idea we would like to try on? From the current results it appears option 1 might be the one to go with? I can start to work on productizing the changes and adding tests if we have settled down on the implementation approach here. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent
[ https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340017#comment-17340017 ] Zach Chen commented on LUCENE-9662: --- Hi [~mikemccand], I've taken a stab at this and created a WIP PR [https://github.com/apache/lucene/pull/128] with some nocommit comments. Could you please take a look and let me know your thoughts? > CheckIndex should be concurrent > --- > > Key: LUCENE-9662 > URL: https://issues.apache.org/jira/browse/LUCENE-9662 > Project: Lucene - Core > Issue Type: Bug >Reporter: Michael McCandless >Priority: Major > > I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, > using a single core out of the 128 cores the box has. > It seems like this is an embarrassingly parallel problem, if the index has > multiple segments, and would finish much more quickly on concurrent hardware > if we did "thread per segment". > If wanted to get even further concurrency, each part of the Lucene index that > is checked is also independent, so it could be "thread per segment per part". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340005#comment-17340005 ] Zach Chen edited comment on LUCENE-9335 at 5/6/21, 5:40 AM: No problem! Writing these scorers has actually been a great exercise for me to understand more on the scoring related APIs and benchmark testing. I have enjoyed it a lot! For the profiling, are you referring to JFR? It is currently enabled by default in luceneutil and I've added the result below from 5 "Med" terms queries (queries file _wikimedium.10M.nostopwords.tasks.5OrMeds_ attached) : *BMM Scorer Run 1* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrMedMedMedMedMed 40.66 (9.7%) 32.77 (7.7%) -19.4% ( -33% - -2%) 0.000 PKLookup 215.12 (1.5%) 221.56 (1.8%) 3.0% ( 0% - 6%) 0.000 {code} CPU merged search profile for my_modified_version: {code:java} PROFILE SUMMARY from 12153 events (total: 12153) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 4.24% 515 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance() 4.22% 513 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore() 3.11% 378 java.util.LinkedList#listIterator() 2.53% 307 java.util.LinkedList$ListItr#next() 2.42% 294 java.util.zip.Inflater#inflateBytesBytes() 2.15% 261 org.apache.lucene.search.DisiPriorityQueue#pop() 1.60% 195 jdk.internal.misc.Unsafe#getByte() 1.47% 179 org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches() 1.41% 171 java.util.AbstractList$SubList#listIterator() 1.36% 165 java.util.AbstractList#listIterator() 1.31% 159 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.22% 148 org.apache.lucene.search.DisiPriorityQueue#upHeap() 1.21% 147 org.apache.lucene.search.DisiPriorityQueue#add() 1.20% 146 java.util.LinkedList$ListItr#() 1.15% 140 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact() 1.11% 135 java.util.LinkedList$ListItr#checkForComodification() 1.08% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable() 1.04% 126 java.nio.DirectByteBuffer#get() 1.00% 122 java.lang.Object#wait() 1.00% 122 org.apache.lucene.search.DisiPriorityQueue#downHeap() 0.95% 116 java.util.AbstractList$Itr#() 0.82% 100 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is() 0.81% 98 org.apache.lucene.store.ByteBufferGuard#getByte() 0.80% 97 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32() 0.73% 89 sun.nio.fs.UnixNativeDispatcher#open0() 0.73% 89 java.lang.ClassLoader#defineClass1() 0.72% 87 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion() 0.69% 84 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock() 0.63% 77 jdk.internal.util.ArraysSupport#mismatch() 0.63% 76 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists() {code} CPU merged search profile for baseline: {code:java} PROFILE SUMMARY from 9671 events (total: 9671) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 2.96% 286 java.util.zip.Inflater#inflateBytesBytes() 1.78% 172 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.73% 167 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact() 1.72% 166 org.apache.lucene.search.DisiPriorityQueue#upHeap() 1.57% 152 java.lang.Object#wait() 1.57% 152 org.apache.lucene.search.DisiPriorityQueue#add() 1.51% 146 org.apache.lucene.search.DisiPriorityQueue#downHeap() 1.43% 138 java.nio.DirectByteBuffer#get() 1.35% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable() 1.15% 111 java.io.RandomAccessFile#readBytes() 1.10% 106 java.lang.ClassLoader#defineClass1() 1.07% 103 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater() 1.01% 98
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340005#comment-17340005 ] Zach Chen commented on LUCENE-9335: --- No problem! Writing these scorers has actually been a great exercise for me to understand more on the scoring related APIs and benchmark testing. I have enjoyed it a lot! For the profiling, are you referring to JFR? It is currently enabled by default in luceneutil and I've added the result below from 5 "Med" terms queries (queries file _wikimedium.10M.nostopwords.tasks.5OrMeds_ attached) : *BMM Scorer Run 1* {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrMedMedMedMedMed 40.66 (9.7%) 32.77 (7.7%) -19.4% ( -33% - -2%) 0.000 PKLookup 215.12 (1.5%) 221.56 (1.8%) 3.0% ( 0% - 6%) 0.000 {code} CPU merged search profile for my_modified_version: {code:java} PROFILE SUMMARY from 12153 events (total: 12153) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 4.24% 515 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance() 4.22% 513 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore() 3.11% 378 java.util.LinkedList#listIterator() 2.53% 307 java.util.LinkedList$ListItr#next() 2.42% 294 java.util.zip.Inflater#inflateBytesBytes() 2.15% 261 org.apache.lucene.search.DisiPriorityQueue#pop() 1.60% 195 jdk.internal.misc.Unsafe#getByte() 1.47% 179 org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches() 1.41% 171 java.util.AbstractList$SubList#listIterator() 1.36% 165 java.util.AbstractList#listIterator() 1.31% 159 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.22% 148 org.apache.lucene.search.DisiPriorityQueue#upHeap() 1.21% 147 org.apache.lucene.search.DisiPriorityQueue#add() 1.20% 146 java.util.LinkedList$ListItr#() 1.15% 140 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact() 1.11% 135 java.util.LinkedList$ListItr#checkForComodification() 1.08% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable() 1.04% 126 java.nio.DirectByteBuffer#get() 1.00% 122 java.lang.Object#wait() 1.00% 122 org.apache.lucene.search.DisiPriorityQueue#downHeap() 0.95% 116 java.util.AbstractList$Itr#() 0.82% 100 java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is() 0.81% 98 org.apache.lucene.store.ByteBufferGuard#getByte() 0.80% 97 org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32() 0.73% 89 sun.nio.fs.UnixNativeDispatcher#open0() 0.73% 89 java.lang.ClassLoader#defineClass1() 0.72% 87 java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion() 0.69% 84 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock() 0.63% 77 jdk.internal.util.ArraysSupport#mismatch() 0.63% 76 org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists() {code} CPU merged search profile for baseline: {code:java} PROFILE SUMMARY from 9671 events (total: 9671) tests.profile.mode=cpu tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT CPU SAMPLES STACK 2.96% 286 java.util.zip.Inflater#inflateBytesBytes() 1.78% 172 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance() 1.73% 167 org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact() 1.72% 166 org.apache.lucene.search.DisiPriorityQueue#upHeap() 1.57% 152 java.lang.Object#wait() 1.57% 152 org.apache.lucene.search.DisiPriorityQueue#add() 1.51% 146 org.apache.lucene.search.DisiPriorityQueue#downHeap() 1.43% 138 java.nio.DirectByteBuffer#get() 1.35% 131 java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable() 1.15% 111 java.io.RandomAccessFile#readBytes() 1.10% 106 java.lang.ClassLoader#defineClass1() 1.07% 103 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater() 1.01% 98 org.apache.lucene.store.ByteBufferGuard#getByte() 1.01% 98
[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-9335: -- Attachment: wikimedium.10M.nostopwords.tasks.5OrMeds > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks, > wikimedium.10M.nostopwords.tasks.5OrMeds > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337948#comment-17337948 ] Zach Chen edited comment on LUCENE-9335 at 5/2/21, 4:16 AM: I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket. Here are the luceneutil results from 2 runs for each implementation: Scorer [https://github.com/apache/lucene/pull/101] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000 PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037{code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009 PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000 {code} BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000 OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000 PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003 {code} was (Author: zacharymorn): I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket. Here are the luceneutil results from 2 runs for each implementation: Scorer [https://github.com/apache/lucene/pull/101] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000 PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037{code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009 PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000 {code} BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000 OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000 PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003 {code} > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks > > Time Spent: 2.5h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of >
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337948#comment-17337948 ] Zach Chen commented on LUCENE-9335: --- I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket. Here are the luceneutil results from 2 runs for each implementation: Scorer [https://github.com/apache/lucene/pull/101] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000 PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037{code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009 PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000 {code} BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000 OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000 PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003 {code} > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks > > Time Spent: 2.5h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-9335: -- Attachment: wikimedium.10M.nostopwords.tasks > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: wikimedium.10M.nostopwords.tasks > > Time Spent: 2.5h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337742#comment-17337742 ] Zach Chen commented on LUCENE-9335: --- Hi [~jpountz], I've done another pass and fixed a few issues in [https://github.com/apache/lucene/pull/101]. I tried some other optimizations as well (such as moving scorer from essential to non-essential list every time minCompetitiveScore gets updated), but they didn't seems to improve the benchmark results much for pure disjunction queries in both implementations. Assuming there's no major miss / bug in the two implementations so far, I also feel that compared with BMW, the main bottleneck in BMM for 2-clause OR queries run by the benchmark is indeed the additional frequent operations performed to check and align on the max score boundary. What do you think? Do you have any suggestion where I should look next? > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2.5h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335125#comment-17335125 ] Zach Chen commented on LUCENE-9335: --- I've implemented the above strategy and opened a new PR for it [https://github.com/apache/lucene/pull/113.] I was using a _BulkScorer_ on top of a collection of _Scorers_ though, instead of a _BulkScorer_ on top of a collection of _BulkScorers_ like _BooleanScorer_, and hope the difference is due to algorithm difference rather than me misunderstanding the intended usage of BulkScorer interface :D . The result from benchmark util still shows it's slower than _WANDScorer_ for 2 clauses queries, especially for OrHighHigh task. During the implementation of this BulkScorer I also realized there were some issues with the other PR I published earlier, so I'll fix them next and see if that will give us better result. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327129#comment-17327129 ] Zach Chen commented on LUCENE-9335: --- Makes sense. I guess the general strategy then would be to implement BMM in the BulkScorer, and do the maxScore initialization and essential / non-essential lists partition once and valid only within that 2048 documents boundary. I'll give that a try! > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326323#comment-17326323 ] Zach Chen commented on LUCENE-9335: --- Hi [~jpountz], I took a stab at implementing BMM and published a new PR here for further discussion [https://github.com/apache/lucene/pull/101] . I'm pretty happy about being able to implement a new scorer, even though its performance is a bit poor (although seems to be on par with the experiment result published in [http://engineering.nyu.edu/~suel/papers/bmm.pdf] for BMM and BMW comparison for 2-clause OR query). Shall we consider adding benchmark query set with 5+ clauses to see the performance comparison, as that seems to be when BMM may outperform BMW as the paper suggested? > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320681#comment-17320681 ] Zach Chen commented on LUCENE-9335: --- bq. Actually you should be able to do it without modifying the benchmarking code, by configuring your Competition object to not verify counts like that in your localrun file: {{comp = competition.Competition(verifyCounts=False)}} Ah I see. Thanks for the tip, will use that going forward! bq. Indeed this indicates that the query returns different top hits with your change. If the change was in the order of one ulp, then this could be due to the fact that the sum might depend on the order in which clauses' scores are summed up, but given the significant score difference, there must be a bigger problem. Have you run tests with this change? This could help figure out where the bug is. Yes the *./gradlew check* was passing before, but I saw your comment in PR and that calculation was indeed incorrect. Let me correct that and try again. > Add a bulk scorer for disjunctions that does dynamic pruning > > > Key: LUCENE-9335 > URL: https://issues.apache.org/jira/browse/LUCENE-9335 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Lucene often gets benchmarked against other engines, e.g. against Tantivy and > PISA at [https://tantivy-search.github.io/bench/] or against research > prototypes in Table 1 of > [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. > Given that top-level disjunctions of term queries are commonly used for > benchmarking, it would be nice to optimize this case a bit more, I suspect > that we could make fewer per-document decisions by implementing a BulkScorer > instead of a Scorer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319923#comment-17319923 ] Zach Chen edited comment on LUCENE-9335 at 4/13/21, 6:08 AM: - I made some further changes to move some block max related logic from DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can inherit. I've published a WIP PR [https://github.com/apache/lucene/pull/81] for those changes for the ease of review. When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes: {code:java} WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+ Traceback (most recent call last): File "src/python/localrun.py", line 53, in comp.benchmark("baseline_vs_patch") File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", line 455, in benchmark randomSeed = self.randomSeed) File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", line 196, in run raise RuntimeError('errors occurred: %s' % str(cmpDiffs)) RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value ([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0){code} I then made further changes in benchUtil.py to skip over verifyScores, so that I can see what benchmark results it would generate: {code:java} diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py index fb50033..c2faffc 100644 --- a/src/python/benchUtil.py +++ b/src/python/benchUtil.py @@ -1203,7 +1203,7 @@ class RunAlgs: cmpRawResults, heapCmp = parseResults(cmpLogFiles) # make sure they got identical results - cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts) + cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False) baseResults = collateResults(baseRawResults) cmpResults = collateResults(cmpRawResults){code} I then got the following benchmark results from multiple runs {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighMed 186.44 (2.8%) 160.50 (4.5%) -13.9% ( -20% - -6%) 0.000 OrHighLow 735.70 (7.5%) 696.89 (4.3%) -5.3% ( -15% - 6%) 0.006 Fuzzy1 75.85 (11.5%) 72.81 (14.0%) -4.0% ( -26% - 24%) 0.323 TermDTSort 237.49 (10.4%) 228.02 (10.6%) -4.0% ( -22% - 18%) 0.230 HighTermMonthSort 280.82 (9.8%) 274.90 (10.8%) -2.1% ( -20% - 20%) 0.518 Fuzzy2 54.08 (12.5%) 53.04 (14.2%) -1.9% ( -25% - 28%) 0.648 OrNotHighMed 672.83 (2.7%) 661.16 (4.7%) -1.7% ( -8% - 5%) 0.153 HighTermTitleBDVSort 438.56 (14.4%) 431.81 (16.6%) -1.5% ( -28% - 34%) 0.754 AndHighLow 969.43 (5.2%) 957.49 (4.7%) -1.2% ( -10% - 9%) 0.432 OrNotHighHigh 704.98 (3.4%) 700.72 (3.9%) -0.6% ( -7% - 7%) 0.605 AndHighHigh 109.77 (4.2%) 109.31 (4.7%) -0.4% ( -9% - 8%) 0.767 BrowseMonthSSDVFacets 32.52 (2.1%) 32.40 (4.6%) -0.4% ( -6% - 6%) 0.755 PKLookup 219.90 (3.1%) 219.16 (3.2%) -0.3% ( -6% - 6%) 0.734 Wildcard 284.84 (1.9%) 284.18 (1.8%) -0.2% ( -3% - 3%) 0.690 Prefix3 361.00 (2.1%) 360.24 (2.0%) -0.2% ( -4% - 4%) 0.750 HighIntervalsOrdered 28.68 (2.2%) 28.64 (1.7%) -0.1% ( -3% - 3%) 0.819 BrowseMonthTaxoFacets 13.60 (2.9%) 13.59 (2.7%) -0.1% ( -5% - 5%) 0.947 BrowseDayOfYearSSDVFacets 28.67 (4.8%) 28.66 (4.8%) -0.0% ( -9% - 10%) 0.979 HighSpanNear 79.29 (2.4%) 79.29 (2.2%) 0.0% ( -4% - 4%) 0.997 OrHighNotHigh 695.37 (5.5%) 696.65 (3.8%) 0.2% ( -8% - 10%) 0.903 MedTerm 1478.47 (3.6%) 1481.54 (3.0%) 0.2% ( -6% - 7%) 0.843 HighTermDayOfYearSort 372.12 (14.1%) 373.08 (14.8%) 0.3% ( -25% - 33%) 0.955 IntNRQ 125.36 (1.3%) 125.72 (0.7%) 0.3% ( -1% - 2%) 0.391 LowSpanNear 52.82 (1.7%) 52.98 (2.0%) 0.3% ( -3% - 4%) 0.611 BrowseDayOfYearTaxoFacets 11.28 (3.1%) 11.31 (3.1%) 0.3% ( -5% - 6%) 0.756 LowSloppyPhrase 154.42 (2.9%) 154.91 (2.9%) 0.3% ( -5% - 6%) 0.731
[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319923#comment-17319923 ] Zach Chen commented on LUCENE-9335: --- I made some further changes to move some block max related logic from DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can inherit. I've published a WIP PR [https://github.com/apache/lucene/pull/81] for those changes for the ease of review. When I run luceneutil, I see further errors from verifyScores section of code, which may indicate bugs in my changes: {code:java} WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+ Traceback (most recent call last): File "src/python/localrun.py", line 53, in comp.benchmark("baseline_vs_patch") File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", line 455, in benchmark randomSeed = self.randomSeed) File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", line 196, in run raise RuntimeError('errors occurred: %s' % str(cmpDiffs)) RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value ([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0){code} I then made further changes in benchUtil.py to skip over verifyScores, so that I can see what benchmark results it would generate: {code:java} diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py index fb50033..c2faffc 100644 --- a/src/python/benchUtil.py +++ b/src/python/benchUtil.py @@ -1203,7 +1203,7 @@ class RunAlgs: cmpRawResults, heapCmp = parseResults(cmpLogFiles) # make sure they got identical results - cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, self.verifyCounts) + cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False) baseResults = collateResults(baseRawResults) cmpResults = collateResults(cmpRawResults){code} I then got the following benchmark results from multiple runs {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighMed 186.44 (2.8%) 160.50 (4.5%) -13.9% ( -20% - -6%) 0.000 OrHighLow 735.70 (7.5%) 696.89 (4.3%) -5.3% ( -15% - 6%) 0.006 Fuzzy1 75.85 (11.5%) 72.81 (14.0%) -4.0% ( -26% - 24%) 0.323 TermDTSort 237.49 (10.4%) 228.02 (10.6%) -4.0% ( -22% - 18%) 0.230 HighTermMonthSort 280.82 (9.8%) 274.90 (10.8%) -2.1% ( -20% - 20%) 0.518 Fuzzy2 54.08 (12.5%) 53.04 (14.2%) -1.9% ( -25% - 28%) 0.648 OrNotHighMed 672.83 (2.7%) 661.16 (4.7%) -1.7% ( -8% - 5%) 0.153 HighTermTitleBDVSort 438.56 (14.4%) 431.81 (16.6%) -1.5% ( -28% - 34%) 0.754 AndHighLow 969.43 (5.2%) 957.49 (4.7%) -1.2% ( -10% - 9%) 0.432 OrNotHighHigh 704.98 (3.4%) 700.72 (3.9%) -0.6% ( -7% - 7%) 0.605 AndHighHigh 109.77 (4.2%) 109.31 (4.7%) -0.4% ( -9% - 8%) 0.767 BrowseMonthSSDVFacets 32.52 (2.1%) 32.40 (4.6%) -0.4% ( -6% - 6%) 0.755 PKLookup 219.90 (3.1%) 219.16 (3.2%) -0.3% ( -6% - 6%) 0.734 Wildcard 284.84 (1.9%) 284.18 (1.8%) -0.2% ( -3% - 3%) 0.690 Prefix3 361.00 (2.1%) 360.24 (2.0%) -0.2% ( -4% - 4%) 0.750 HighIntervalsOrdered 28.68 (2.2%) 28.64 (1.7%) -0.1% ( -3% - 3%) 0.819 BrowseMonthTaxoFacets 13.60 (2.9%) 13.59 (2.7%) -0.1% ( -5% - 5%) 0.947 BrowseDayOfYearSSDVFacets 28.67 (4.8%) 28.66 (4.8%) -0.0% ( -9% - 10%) 0.979 HighSpanNear 79.29 (2.4%) 79.29 (2.2%) 0.0% ( -4% - 4%) 0.997 OrHighNotHigh 695.37 (5.5%) 696.65 (3.8%) 0.2% ( -8% - 10%) 0.903 MedTerm 1478.47 (3.6%) 1481.54 (3.0%) 0.2% ( -6% - 7%) 0.843 HighTermDayOfYearSort 372.12 (14.1%) 373.08 (14.8%) 0.3% ( -25% - 33%) 0.955 IntNRQ 125.36 (1.3%) 125.72 (0.7%) 0.3% ( -1% - 2%) 0.391 LowSpanNear 52.82 (1.7%) 52.98 (2.0%) 0.3% ( -3% - 4%) 0.611 BrowseDayOfYearTaxoFacets 11.28 (3.1%) 11.31 (3.1%) 0.3% ( -5% - 6%) 0.756 LowSloppyPhrase 154.42 (2.9%) 154.91 (2.9%) 0.3% ( -5% - 6%) 0.731 MedPhrase 143.27