[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570417#comment-17570417 ] Zach Chen commented on LUCENE-10480: >From the latest nightly benchmark result, the negative impact to nested >boolean queries have been resolved, and the performance boost to top-level >disjunction queries have been maintained. Thanks for all the guidance >[~jpountz] ! > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Zach Chen >Priority: Minor > Time Spent: 11h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568798#comment-17568798 ] ASF subversion and git services commented on LUCENE-10480: -- Commit 8ebb3305648aea8f551c2dd144d5a527b8982638 in lucene's branch refs/heads/branch_9x from Zach Chen [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8ebb3305648 ] LUCENE-10480: (Backporting) Use BulkScorer to limit BMMScorer to only top-level disjunctions (#1037) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 11h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568781#comment-17568781 ] ASF subversion and git services commented on LUCENE-10480: -- Commit 28ce8abb5105dba5bc08b7f800f86f3741268bc9 in lucene's branch refs/heads/main from Zach Chen [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ce8abb510 ] LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions (#1018) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 11h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566149#comment-17566149 ] Zach Chen commented on LUCENE-10480: {quote}I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. {quote} Sounds good. I tried a few quick approaches to limit BMM scorer to top-level disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they didn't work due to weight's / query's recursive logic. So I ended up wrapping the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] pending tests update) like your other PR. Please let me know if this approach looks good to you, or if there's a better approach. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7.5h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565375#comment-17565375 ] Adrien Grand commented on LUCENE-10480: --- +1 to explore this in a separate issue. bq. Do you think this slowdown to AndHighOrMedMed may be considered as blocker to 9.3 release? I wouldn't say blocker, but maybe we could give us time indeed by only using this new scorer on top-level disjunctions for now so that we have more time to figure out whether we should stick to BMW or switch to BMM for inner disjunctions. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565261#comment-17565261 ] Zach Chen commented on LUCENE-10480: {quote}Another thing that changes performance sometimes is the doc ID order, were you using multiple indexing threads maybe? {quote} Ok this is actually the case for me. I was previously using 10 threads to index (INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with default setting, I was able to reproduce the slowdown: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 91.27 (4.3%) 85.52 (4.3%) -6.3% ( -14% - 2%) 0.000 PKLookup 333.25 (4.3%) 329.48 (3.8%) -1.1% ( -8% - 7%) 0.380 AndHighHigh 104.25 (2.9%) 103.11 (3.0%) -1.1% ( -6% - 5%) 0.247 SpanNear 16.52 (3.8%) 16.36 (3.1%) -0.9% ( -7% - 6%) 0.396 TermGroup10K 23.99 (3.3%) 23.78 (3.0%) -0.9% ( -6% - 5%) 0.384 Phrase 234.74 (2.7%) 232.71 (1.8%) -0.9% ( -5% - 3%) 0.235 AndHighMed 163.80 (3.5%) 162.42 (4.3%) -0.8% ( -8% - 7%) 0.496 TermBGroup1M 48.02 (3.5%) 47.65 (3.7%) -0.8% ( -7% - 6%) 0.496 SloppyPhrase 4.82 (3.4%) 4.78 (2.7%) -0.7% ( -6% - 5%) 0.460 TermGroup100 41.90 (3.9%) 41.63 (3.3%) -0.7% ( -7% - 6%) 0.569 Term 2680.42 (4.7%) 2664.05 (3.3%) -0.6% ( -8% - 7%) 0.632 TermGroup1M 39.95 (2.9%) 39.71 (3.2%) -0.6% ( -6% - 5%) 0.531 TermBGroup1M1P 84.21 (6.1%) 83.82 (5.7%) -0.5% ( -11% - 12%) 0.801 Respell 113.78 (1.9%) 113.44 (1.7%) -0.3% ( -3% - 3%) 0.603 BrowseRandomLabelSSDVFacets 20.75 (8.2%) 20.74 (10.3%) -0.0% ( -17% - 20%) 0.989 Fuzzy2 83.12 (1.8%) 83.11 (1.1%) -0.0% ( -2% - 2%) 0.976 BrowseDayOfYearSSDVFacets 26.69 (12.0%) 26.70 (11.6%) 0.0% ( -21% - 26%) 0.995 Wildcard 115.84 (5.1%) 115.96 (5.8%) 0.1% ( -10% - 11%) 0.951 TermDayOfYearSort 260.70 (5.4%) 260.99 (2.8%) 0.1% ( -7% - 8%) 0.937 AndHighMedDayTaxoFacets 136.32 (2.6%) 136.63 (2.3%) 0.2% ( -4% - 5%) 0.773 IntervalsOrdered 128.13 (7.5%) 128.45 (7.7%) 0.3% ( -13% - 16%) 0.916 AndHighHighDayTaxoFacets 13.82 (2.8%) 13.87 (2.6%) 0.4% ( -4% - 5%) 0.657 Fuzzy1 79.16 (2.7%) 79.60 (1.8%) 0.6% ( -3% - 5%) 0.433 TermMonthSort 360.17 (6.4%) 362.83 (7.1%) 0.7% ( -11% - 15%) 0.728 TermTitleSort 191.21 (6.8%) 192.70 (7.1%) 0.8% ( -12% - 15%) 0.723 TermDTSort 208.40 (2.9%) 210.39 (2.9%) 1.0% ( -4% - 7%) 0.301 MedTermDayTaxoFacets 78.66 (5.2%) 79.59 (4.4%) 1.2% ( -7% - 11%) 0.436 TermDateFacets 41.04 (5.4%) 41.61 (4.7%) 1.4% ( -8% - 12%) 0.385 IntNRQ 122.00 (8.1%) 124.08 (8.3%) 1.7% ( -13% - 19%) 0.513 OrHighMedDayTaxoFacets 23.16 (8.4%) 23.71 (4.9%) 2.4% ( -10% - 17%) 0.272 BrowseMonthSSDVFacets 28.68 (13.8%) 29.55 (16.8%) 3.0% ( -24% - 39%) 0.531 BrowseDayOfYearTaxoFacets 30.40 (32.2%) 31.67 (34.2%) 4.2% ( -47% - 103%) 0.690 BrowseDateTaxoFacets 30.26 (32.2%) 31.57 (34.4%) 4.3% ( -47% - 104%) 0.680 Prefix3 402.14 (8.6%) 419.96 (8.9%) 4.4% ( -12% - 23%) 0.109 AndMedOrHighHigh 94.79 (4.0%) 99.03 (4.5%) 4.5% ( -3% - 13%) 0.001 BrowseRandomLabelTaxoFacets 32.45 (49.2%) 35.05 (53.4%) 8.0% ( -63% - 217%) 0.622 BrowseMonthTaxoFacets 28.68 (35.3%) 31.37 (39.1%) 9.4% ( -48% - 129%) 0.425 BrowseDateSSDVFacets 3.96 (28.1%) 4.54 (26.3%) 14.7% ( -31% - 96%) 0.089
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564885#comment-17564885 ] Adrien Grand commented on LUCENE-10480: --- I haven't tried to reproduce it but the steps you took by running on wikibigall with the nightly tasks file sound good to me. Another thing that changes performance sometimes is the doc ID order, were you using multiple indexing threads maybe? Ignoring the fact that we cannot reproduce the slowdown, if I try to think of the main differences between WANDScorer and BlockMaxMaxscoreScorer for AndHighOrMedMed, I think the main one is the way that {{advanceShallow}} is computed. Conjunctions use block boundaries of the clause that has the lowest cost, so this could explain why we are seeing a slowdown with AndHighOrMedMed (since the conjunction uses block boundaries of OrMedMed) and not AndMedOrHighHigh (since the conjunction uses block boundaries of Med). Maybe we could explore other approaches for {{advanceShallow}} such as taking the minimum block boundary across essential clauses only instead of all clauses. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564747#comment-17564747 ] Zach Chen commented on LUCENE-10480: {quote}I'll see if I can run the original nightly benchmark code / tests from my machine to see if there's any difference. {quote} I tried to run ** *nightlyBench.py* locally on my machine over the weekend, but that turns out to require some changes to the script itself, and I haven't been able to run it fully so far. On the other hand, I tried a few more run configurations with ** *localrun.py,* including running it in a virtual ubuntu box (as the nightly benchmark runs on linux box), but still have no luck so far re-producing the [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] slow-down. [~jpountz], just curious, are you able to reproduce the slow-down locally on your end as well ? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564611#comment-17564611 ] Zach Chen commented on LUCENE-10480: {quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html] recovered fully but [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] only a bit. I'm unsure what explains there is still a slowdown compared to BMW. {quote} Hmm this is quite strange. Looks like [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] was still having about -13% (5 / 38) impact. I just ran the full suite of wikinightly tasks a few times (by copying *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source *wikimedium10m,* and removing *VectorSearch* queries as they were causing failure NPE for me) but couldn't reproduce the slow down (baseline is using head before all BMM changes): {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseRandomLabelSSDVFacets 20.83 (3.8%) 20.09 (6.5%) -3.6% ( -13% - 6%) 0.034 BrowseMonthSSDVFacets 30.36 (10.6%) 29.56 (12.7%) -2.7% ( -23% - 23%) 0.473 Prefix3 402.70 (9.3%) 397.59 (9.9%) -1.3% ( -18% - 19%) 0.674 TermDayOfYearSort 183.55 (6.5%) 181.61 (6.9%) -1.1% ( -13% - 13%) 0.617 TermTitleSort 195.99 (7.2%) 194.25 (8.1%) -0.9% ( -15% - 15%) 0.713 PKLookup 293.80 (3.7%) 291.47 (4.8%) -0.8% ( -8% - 7%) 0.555 TermMonthSort 283.86 (7.1%) 281.74 (8.0%) -0.7% ( -14% - 15%) 0.755 Wildcard 227.26 (6.2%) 225.87 (6.4%) -0.6% ( -12% - 12%) 0.759 Term 2227.50 (3.7%) 2219.57 (3.3%) -0.4% ( -7% - 6%) 0.748 Fuzzy1 134.77 (2.8%) 134.37 (2.3%) -0.3% ( -5% - 4%) 0.712 TermGroup100 53.61 (3.7%) 53.47 (4.6%) -0.3% ( -8% - 8%) 0.846 TermDTSort 143.16 (3.2%) 142.89 (3.3%) -0.2% ( -6% - 6%) 0.857 TermBGroup1M1P 79.44 (5.5%) 79.29 (5.5%) -0.2% ( -10% - 11%) 0.917 AndHighHighDayTaxoFacets 45.01 (2.3%) 44.94 (2.1%) -0.1% ( -4% - 4%) 0.833 BrowseRandomLabelTaxoFacets 30.94 (50.0%) 30.92 (46.8%) -0.0% ( -64% - 193%) 0.998 AndHighMedDayTaxoFacets 78.11 (3.2%) 78.11 (3.0%) -0.0% ( -6% - 6%) 0.998 Phrase 202.17 (2.7%) 202.18 (2.0%) 0.0% ( -4% - 4%) 0.996 Fuzzy2 76.10 (2.6%) 76.15 (2.0%) 0.1% ( -4% - 4%) 0.933 TermGroup1M 22.65 (3.8%) 22.67 (3.2%) 0.1% ( -6% - 7%) 0.919 TermDateFacets 32.50 (5.3%) 32.60 (5.5%) 0.3% ( -9% - 11%) 0.861 BrowseDayOfYearSSDVFacets 26.31 (5.9%) 26.39 (8.5%) 0.3% ( -13% - 15%) 0.897 Respell 88.21 (2.2%) 88.49 (2.1%) 0.3% ( -3% - 4%) 0.642 SpanNear 16.14 (4.0%) 16.22 (4.2%) 0.5% ( -7% - 9%) 0.706 MedTermDayTaxoFacets 73.42 (4.8%) 73.85 (4.9%) 0.6% ( -8% - 10%) 0.708 TermBGroup1M 48.92 (4.2%) 49.23 (2.8%) 0.6% ( -6% - 8%) 0.581 IntervalsOrdered 22.42 (5.8%) 22.59 (4.2%) 0.7% ( -8% - 11%) 0.651 OrHighMedDayTaxoFacets 25.27 (6.1%) 25.46 (6.6%) 0.7% ( -11% - 14%) 0.711 TermGroup10K 30.26 (4.2%) 30.50 (2.9%) 0.8% ( -6% - 8%) 0.494 SloppyPhrase 91.40 (5.6%) 92.16 (6.3%) 0.8% ( -10% - 13%) 0.662 IntNRQ 152.74 (20.3%) 154.86 (17.1%) 1.4% ( -29% - 48%) 0.815 AndHighMed 88.55 (2.6%) 89.98 (3.1%) 1.6% ( -3% - 7%) 0.073 AndHighHigh 29.10 (2.7%) 29.68 (3.1%) 2.0% ( -3% - 8%) 0.032 BrowseDayOfYearTaxoFacets 31.29 (40.0%) 31.93 (38.0%) 2.0% ( -54% - 133%) 0.869 BrowseDateTaxoFacets 31.18 (40.3%) 31.87 (38.5%) 2.2
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564565#comment-17564565 ] Adrien Grand commented on LUCENE-10480: --- [AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html] recovered fully but [AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] only a bit. I'm unsure what explains there is still a slowdown compared to BMW. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564326#comment-17564326 ] ASF subversion and git services commented on LUCENE-10480: -- Commit 090cbc50dd7e5659494149f470378ab7f6a90cf1 in lucene's branch refs/heads/branch_9x from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=090cbc50dd7 ] LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006) (#1008) (cherry picked from commit da8143bfa38cd5fadae4b4712b9e639e79016021) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563627#comment-17563627 ] ASF subversion and git services commented on LUCENE-10480: -- Commit da8143bfa38cd5fadae4b4712b9e639e79016021 in lucene's branch refs/heads/main from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=da8143bfa38 ] LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 7h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563536#comment-17563536 ] Zach Chen commented on LUCENE-10480: Ok I see. Maybe I can also try to run some benchmark experiments with different JVM compilation / code cache parameters to further test things out. Will report back if I find something interesting! > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 6h 20m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563022#comment-17563022 ] Adrien Grand commented on LUCENE-10480: --- I still suspect that one issue when only running queries that are very good at dynamic pruning is that the JVM doesn't have time to warm up. These queries can figure out the top 10 hits by only evaluating a few thousands hits, so probably that parts of the logic still runs in interpreted mode. The fact that queries run slower when you run them in isolation further suggests that this is the problematic scenario, not the case when the benchmark includes multiple types of queries? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 6h > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562944#comment-17562944 ] Zach Chen commented on LUCENE-10480: {quote}maybe there are bits from advance() that we could move to matches() so that we would hand it over to the other clause before we start doing expensive operations like computing scores. {quote} This approach does help stabilizing performance for disjunction within conjunction queries (and also provide some small gains)! I have opened a PR for it [https://github.com/apache/lucene/pull/1006] . > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 50m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562919#comment-17562919 ] Zach Chen commented on LUCENE-10480: {quote}Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. {quote} The results look encouraging and interesting! I copied and pasted the boolean queries from *wikinightly.tasks* into *wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to re-produce the slow-down: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 108.16 (6.5%) 100.44 (5.4%) -7.1% ( -17% - 5%) 0.000 AndMedOrHighHigh 68.37 (4.5%) 63.92 (5.0%) -6.5% ( -15% - 3%) 0.000 AndHighHigh 122.90 (5.5%) 122.77 (5.5%) -0.1% ( -10% - 11%) 0.952 AndHighMed 113.27 (6.4%) 114.63 (6.2%) 1.2% ( -10% - 14%) 0.546 PKLookup 228.08 (14.4%) 232.90 (14.7%) 2.1% ( -23% - 36%) 0.646 OrHighHigh 26.89 (5.7%) 48.62 (12.2%) 80.8% ( 59% - 104%) 0.000 OrHighMed 81.18 (5.9%) 187.05 (12.2%) 130.4% ( 105% - 157%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 85.67 (5.3%) 73.23 (5.7%) -14.5% ( -24% - -3%) 0.000 PKLookup 260.08 (13.4%) 253.74 (14.9%) -2.4% ( -27% - 29%) 0.586 AndHighHigh 73.68 (4.7%) 72.70 (4.1%) -1.3% ( -9% - 7%) 0.339 AndHighMed 89.52 (5.1%) 88.55 (4.4%) -1.1% ( -10% - 8%) 0.470 AndHighOrMedMed 63.27 (6.5%) 70.48 (5.7%) 11.4% ( 0% - 25%) 0.000 OrHighHigh 19.60 (5.3%) 25.62 (7.6%) 30.8% ( 16% - 46%) 0.000 OrHighMed 121.08 (5.7%) 236.34 (10.2%) 95.2% ( 74% - 117%) 0.000 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndMedOrHighHigh 86.88 (3.4%) 76.60 (3.1%) -11.8% ( -17% - -5%) 0.000 AndHighHigh 30.49 (3.5%) 30.36 (3.5%) -0.4% ( -7% - 6%) 0.697 AndHighMed 192.76 (3.4%) 193.72 (3.9%) 0.5% ( -6% - 8%) 0.671 PKLookup 262.59 (5.5%) 264.52 (7.9%) 0.7% ( -11% - 14%) 0.731 AndHighOrMedMed 65.47 (3.8%) 73.43 (3.0%) 12.2% ( 5% - 19%) 0.000 OrHighHigh 21.47 (4.1%) 36.94 (8.3%) 72.1% ( 57% - 88%) 0.000 OrHighMed 99.91 (4.3%) 292.05 (12.9%) 192.3% ( 167% - 218%) 0.000 {code} However, when I reduced the type of tasks further into just conjunction + disjunction (and with default number of search threads), the results actually turned positive and were similar to what I saw earlier in [https://github.com/apache/lucene/pull/972#issuecomment-1166188875] {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value AndHighOrMedMed 58.65 (37.3%) 71.63 (28.9%) 22.1% ( -32% - 140%) 0.036 AndMedOrHighHigh 36.43 (39.3%) 44.61 (30.7%) 22.4% ( -34% - 152%) 0.044 PKLookup 163.58 (34.4%) 211.88 (32.7%) 29.5% ( -27% - 147%) 0.005 {code} {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value PKLookup 146.51 (22.0%) 188.92 (30.1%) 28.9% ( -18% - 103%) 0.001 AndMedOrHighHigh 35.59 (27.1%) 49.99 (37.5%) 40.4% ( -18% - 144%) 0.000 AndHighOrMedMed 44.47 (26.6%) 63.
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562730#comment-17562730 ] Adrien Grand commented on LUCENE-10480: --- Looking at this new scorer from the perspective of disjunctions within conjunctions, maybe there are bits from advance() that we could move to matches() so that we would hand it over to the other clause before we start doing expensive operations like computing scores. What do you think [~zacharymorn]? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562711#comment-17562711 ] Adrien Grand commented on LUCENE-10480: --- Nightly benchmarks picked up the change and top-level disjunctions are seeing massive speedups, see [OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. However disjunctions within conjunctions got a slowdown, see [AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html] or [AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562266#comment-17562266 ] ASF subversion and git services commented on LUCENE-10480: -- Commit a5c99aca1abc9b73a0c68d4f23533311382b718c in lucene's branch refs/heads/branch_9x from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a5c99aca1ab ] LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) (#1002) (cherry picked from commit 503ec5597331454bf8b6af79b9701cfdccf5) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 40m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561787#comment-17561787 ] ASF subversion and git services commented on LUCENE-10480: -- Commit 503ec5597331454bf8b6af79b9701cfdccf5 in lucene's branch refs/heads/main from zacharymorn [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=503ec559733 ] LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Time Spent: 5h 10m > Remaining Estimate: 0h > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553494#comment-17553494 ] Adrien Grand commented on LUCENE-10480: --- Good question, looking at your BlockMaxMaxScoreScorer it looks like it also has potential for being specialized in the 2-clauses case by having two sub scorers and tracking during document collection whether the scorer that produces lower scores is optional or required. I didn't have concrete plans in mind when opening the issue, I was just observing that we pay significant overhead for supporting arbitrary numbers of clauses when disjunctions often have only two clauses. > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552556#comment-17552556 ] Zach Chen commented on LUCENE-10480: Hi [~jpountz] , this issue reminded me of our experiments last year implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% improvement for OrHighHigh and OrHighMed queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do you think we should continue to explore in that direction, or there might be better / simpler algorithms we could look into? > Specialize 2-clauses disjunctions > - > > Key: LUCENE-10480 > URL: https://issues.apache.org/jira/browse/LUCENE-10480 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > WANDScorer is nice, but it also has lots of overhead to maintain its > invariants: one linked list for the current candidates, one priority queue of > scorers that are behind, another one for scorers that are ahead. All this > could be simplified in the 2-clauses case, which feels worth specializing for > as it's very common that end users enter queries that only have two terms? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org