[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570417#comment-17570417
 ] 

Zach Chen commented on LUCENE-10480:


>From the latest nightly benchmark result, the negative impact to nested 
>boolean queries have been resolved, and the performance boost to top-level 
>disjunction queries have been maintained. Thanks for all the guidance 
>[~jpountz] !

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-19 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen resolved LUCENE-10480.

  Assignee: Zach Chen
Resolution: Done

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/13/22 5:09 AM:
-

{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 


was (Author: zacharymorn):
{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 

 

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566149#comment-17566149
 ] 

Zach Chen commented on LUCENE-10480:


{quote}I wouldn't say blocker, but maybe we could give us time indeed by only 
using this new scorer on top-level disjunctions for now so that we have more 
time to figure out whether we should stick to BMW or switch to BMM for inner 
disjunctions.
{quote}
Sounds good. I tried a few quick approaches to limit BMM scorer to top-level 
disjunctions in *BooleanWeight* or {*}Boolean2ScorerSupplier{*}, but they 
didn't work due to weight's / query's recursive logic. So I ended up wrapping 
the scorer inside a bulk scorer ([https://github.com/apache/lucene/pull/1018,] 
pending tests update) like your other PR. Please let me know if this approach 
looks good to you, or if there's a better approach. 

 

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261
 ] 

Zach Chen commented on LUCENE-10480:


{quote}Another thing that changes performance sometimes is the doc ID order, 
were you using multiple indexing threads maybe?
{quote}
Ok this is actually the case for me. I was previously using 10 threads to index 
(INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with 
default setting, I was able to reproduce the slowdown:

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       91.27      (4.3%)       85.52      
(4.3%)   -6.3% ( -14% -    2%) 0.000
                        PKLookup      333.25      (4.3%)      329.48      
(3.8%)   -1.1% (  -8% -    7%) 0.380
                     AndHighHigh      104.25      (2.9%)      103.11      
(3.0%)   -1.1% (  -6% -    5%) 0.247
                        SpanNear       16.52      (3.8%)       16.36      
(3.1%)   -0.9% (  -7% -    6%) 0.396
                    TermGroup10K       23.99      (3.3%)       23.78      
(3.0%)   -0.9% (  -6% -    5%) 0.384
                          Phrase      234.74      (2.7%)      232.71      
(1.8%)   -0.9% (  -5% -    3%) 0.235
                      AndHighMed      163.80      (3.5%)      162.42      
(4.3%)   -0.8% (  -8% -    7%) 0.496
                    TermBGroup1M       48.02      (3.5%)       47.65      
(3.7%)   -0.8% (  -7% -    6%) 0.496
                    SloppyPhrase        4.82      (3.4%)        4.78      
(2.7%)   -0.7% (  -6% -    5%) 0.460
                    TermGroup100       41.90      (3.9%)       41.63      
(3.3%)   -0.7% (  -7% -    6%) 0.569
                            Term     2680.42      (4.7%)     2664.05      
(3.3%)   -0.6% (  -8% -    7%) 0.632
                     TermGroup1M       39.95      (2.9%)       39.71      
(3.2%)   -0.6% (  -6% -    5%) 0.531
                  TermBGroup1M1P       84.21      (6.1%)       83.82      
(5.7%)   -0.5% ( -11% -   12%) 0.801
                         Respell      113.78      (1.9%)      113.44      
(1.7%)   -0.3% (  -3% -    3%) 0.603
     BrowseRandomLabelSSDVFacets       20.75      (8.2%)       20.74     
(10.3%)   -0.0% ( -17% -   20%) 0.989
                          Fuzzy2       83.12      (1.8%)       83.11      
(1.1%)   -0.0% (  -2% -    2%) 0.976
       BrowseDayOfYearSSDVFacets       26.69     (12.0%)       26.70     
(11.6%)    0.0% ( -21% -   26%) 0.995
                        Wildcard      115.84      (5.1%)      115.96      
(5.8%)    0.1% ( -10% -   11%) 0.951
               TermDayOfYearSort      260.70      (5.4%)      260.99      
(2.8%)    0.1% (  -7% -    8%) 0.937
         AndHighMedDayTaxoFacets      136.32      (2.6%)      136.63      
(2.3%)    0.2% (  -4% -    5%) 0.773
                IntervalsOrdered      128.13      (7.5%)      128.45      
(7.7%)    0.3% ( -13% -   16%) 0.916
        AndHighHighDayTaxoFacets       13.82      (2.8%)       13.87      
(2.6%)    0.4% (  -4% -    5%) 0.657
                          Fuzzy1       79.16      (2.7%)       79.60      
(1.8%)    0.6% (  -3% -    5%) 0.433
                   TermMonthSort      360.17      (6.4%)      362.83      
(7.1%)    0.7% ( -11% -   15%) 0.728
                   TermTitleSort      191.21      (6.8%)      192.70      
(7.1%)    0.8% ( -12% -   15%) 0.723
                      TermDTSort      208.40      (2.9%)      210.39      
(2.9%)    1.0% (  -4% -    7%) 0.301
            MedTermDayTaxoFacets       78.66      (5.2%)       79.59      
(4.4%)    1.2% (  -7% -   11%) 0.436
                  TermDateFacets       41.04      (5.4%)       41.61      
(4.7%)    1.4% (  -8% -   12%) 0.385
                          IntNRQ      122.00      (8.1%)      124.08      
(8.3%)    1.7% ( -13% -   19%) 0.513
          OrHighMedDayTaxoFacets       23.16      (8.4%)       23.71      
(4.9%)    2.4% ( -10% -   17%) 0.272
           BrowseMonthSSDVFacets       28.68     (13.8%)       29.55     
(16.8%)    3.0% ( -24% -   39%) 0.531
       BrowseDayOfYearTaxoFacets       30.40     (32.2%)       31.67     
(34.2%)    4.2% ( -47% -  103%) 0.690
            BrowseDateTaxoFacets       30.26     (32.2%)       31.57     
(34.4%)    4.3% ( -47% -  104%) 0.680
                         Prefix3      402.14      (8.6%)      419.96      
(8.9%)    4.4% ( -12% -   23%) 0.109
                AndMedOrHighHigh       94.79      (4.0%)       99.03      
(4.5%)    4.5% (  -3% -   13%) 0.001
     BrowseRandomLabelTaxoFacets       32.45     (49.2%)       35.05     
(53.4%)    8.0% ( -63% -  217%) 0.622
           BrowseMonthTaxoFacets       28.68     (35.3%)       31.37     
(39.1%)    9.4% ( -48% -  129%) 0.425
            BrowseDateSSDVFacets        3.96     (28.1%)        4.54     
(26.3%)   14.7% ( -31% -   96%) 0.089
                   

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/12/22 4:27 AM:
-

{quote}Another thing that changes performance sometimes is the doc ID order, 
were you using multiple indexing threads maybe?
{quote}
Ok this is actually the case for me. I was previously using 10 threads to index 
(INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with 
default setting, I was able to reproduce the slowdown:
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       91.27      (4.3%)       85.52      
(4.3%)   -6.3% ( -14% -    2%) 0.000
                        PKLookup      333.25      (4.3%)      329.48      
(3.8%)   -1.1% (  -8% -    7%) 0.380
                     AndHighHigh      104.25      (2.9%)      103.11      
(3.0%)   -1.1% (  -6% -    5%) 0.247
                        SpanNear       16.52      (3.8%)       16.36      
(3.1%)   -0.9% (  -7% -    6%) 0.396
                    TermGroup10K       23.99      (3.3%)       23.78      
(3.0%)   -0.9% (  -6% -    5%) 0.384
                          Phrase      234.74      (2.7%)      232.71      
(1.8%)   -0.9% (  -5% -    3%) 0.235
                      AndHighMed      163.80      (3.5%)      162.42      
(4.3%)   -0.8% (  -8% -    7%) 0.496
                    TermBGroup1M       48.02      (3.5%)       47.65      
(3.7%)   -0.8% (  -7% -    6%) 0.496
                    SloppyPhrase        4.82      (3.4%)        4.78      
(2.7%)   -0.7% (  -6% -    5%) 0.460
                    TermGroup100       41.90      (3.9%)       41.63      
(3.3%)   -0.7% (  -7% -    6%) 0.569
                            Term     2680.42      (4.7%)     2664.05      
(3.3%)   -0.6% (  -8% -    7%) 0.632
                     TermGroup1M       39.95      (2.9%)       39.71      
(3.2%)   -0.6% (  -6% -    5%) 0.531
                  TermBGroup1M1P       84.21      (6.1%)       83.82      
(5.7%)   -0.5% ( -11% -   12%) 0.801
                         Respell      113.78      (1.9%)      113.44      
(1.7%)   -0.3% (  -3% -    3%) 0.603
     BrowseRandomLabelSSDVFacets       20.75      (8.2%)       20.74     
(10.3%)   -0.0% ( -17% -   20%) 0.989
                          Fuzzy2       83.12      (1.8%)       83.11      
(1.1%)   -0.0% (  -2% -    2%) 0.976
       BrowseDayOfYearSSDVFacets       26.69     (12.0%)       26.70     
(11.6%)    0.0% ( -21% -   26%) 0.995
                        Wildcard      115.84      (5.1%)      115.96      
(5.8%)    0.1% ( -10% -   11%) 0.951
               TermDayOfYearSort      260.70      (5.4%)      260.99      
(2.8%)    0.1% (  -7% -    8%) 0.937
         AndHighMedDayTaxoFacets      136.32      (2.6%)      136.63      
(2.3%)    0.2% (  -4% -    5%) 0.773
                IntervalsOrdered      128.13      (7.5%)      128.45      
(7.7%)    0.3% ( -13% -   16%) 0.916
        AndHighHighDayTaxoFacets       13.82      (2.8%)       13.87      
(2.6%)    0.4% (  -4% -    5%) 0.657
                          Fuzzy1       79.16      (2.7%)       79.60      
(1.8%)    0.6% (  -3% -    5%) 0.433
                   TermMonthSort      360.17      (6.4%)      362.83      
(7.1%)    0.7% ( -11% -   15%) 0.728
                   TermTitleSort      191.21      (6.8%)      192.70      
(7.1%)    0.8% ( -12% -   15%) 0.723
                      TermDTSort      208.40      (2.9%)      210.39      
(2.9%)    1.0% (  -4% -    7%) 0.301
            MedTermDayTaxoFacets       78.66      (5.2%)       79.59      
(4.4%)    1.2% (  -7% -   11%) 0.436
                  TermDateFacets       41.04      (5.4%)       41.61      
(4.7%)    1.4% (  -8% -   12%) 0.385
                          IntNRQ      122.00      (8.1%)      124.08      
(8.3%)    1.7% ( -13% -   19%) 0.513
          OrHighMedDayTaxoFacets       23.16      (8.4%)       23.71      
(4.9%)    2.4% ( -10% -   17%) 0.272
           BrowseMonthSSDVFacets       28.68     (13.8%)       29.55     
(16.8%)    3.0% ( -24% -   39%) 0.531
       BrowseDayOfYearTaxoFacets       30.40     (32.2%)       31.67     
(34.2%)    4.2% ( -47% -  103%) 0.690
            BrowseDateTaxoFacets       30.26     (32.2%)       31.57     
(34.4%)    4.3% ( -47% -  104%) 0.680
                         Prefix3      402.14      (8.6%)      419.96      
(8.9%)    4.4% ( -12% -   23%) 0.109
                AndMedOrHighHigh       94.79      (4.0%)       99.03      
(4.5%)    4.5% (  -3% -   13%) 0.001
     BrowseRandomLabelTaxoFacets       32.45     (49.2%)       35.05     
(53.4%)    8.0% ( -63% -  217%) 0.622
           BrowseMonthTaxoFacets       28.68     (35.3%)       31.37     
(39.1%)    9.4% ( -48% -  129%) 0.425
            BrowseDateSSDVFacets        3.96     (28.1%)        4.54     
(26.3%)   

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565261#comment-17565261
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/12/22 4:27 AM:
-

{quote}Another thing that changes performance sometimes is the doc ID order, 
were you using multiple indexing threads maybe?
{quote}
Ok this is actually the case for me. I was previously using 10 threads to index 
(INDEX_NUM_THREADS = 10) , and after I commented that out and reindexed with 
default setting, I was able to reproduce the slowdown:
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       91.27      (4.3%)       85.52      
(4.3%)   -6.3% ( -14% -    2%) 0.000
                        PKLookup      333.25      (4.3%)      329.48      
(3.8%)   -1.1% (  -8% -    7%) 0.380
                     AndHighHigh      104.25      (2.9%)      103.11      
(3.0%)   -1.1% (  -6% -    5%) 0.247
                        SpanNear       16.52      (3.8%)       16.36      
(3.1%)   -0.9% (  -7% -    6%) 0.396
                    TermGroup10K       23.99      (3.3%)       23.78      
(3.0%)   -0.9% (  -6% -    5%) 0.384
                          Phrase      234.74      (2.7%)      232.71      
(1.8%)   -0.9% (  -5% -    3%) 0.235
                      AndHighMed      163.80      (3.5%)      162.42      
(4.3%)   -0.8% (  -8% -    7%) 0.496
                    TermBGroup1M       48.02      (3.5%)       47.65      
(3.7%)   -0.8% (  -7% -    6%) 0.496
                    SloppyPhrase        4.82      (3.4%)        4.78      
(2.7%)   -0.7% (  -6% -    5%) 0.460
                    TermGroup100       41.90      (3.9%)       41.63      
(3.3%)   -0.7% (  -7% -    6%) 0.569
                            Term     2680.42      (4.7%)     2664.05      
(3.3%)   -0.6% (  -8% -    7%) 0.632
                     TermGroup1M       39.95      (2.9%)       39.71      
(3.2%)   -0.6% (  -6% -    5%) 0.531
                  TermBGroup1M1P       84.21      (6.1%)       83.82      
(5.7%)   -0.5% ( -11% -   12%) 0.801
                         Respell      113.78      (1.9%)      113.44      
(1.7%)   -0.3% (  -3% -    3%) 0.603
     BrowseRandomLabelSSDVFacets       20.75      (8.2%)       20.74     
(10.3%)   -0.0% ( -17% -   20%) 0.989
                          Fuzzy2       83.12      (1.8%)       83.11      
(1.1%)   -0.0% (  -2% -    2%) 0.976
       BrowseDayOfYearSSDVFacets       26.69     (12.0%)       26.70     
(11.6%)    0.0% ( -21% -   26%) 0.995
                        Wildcard      115.84      (5.1%)      115.96      
(5.8%)    0.1% ( -10% -   11%) 0.951
               TermDayOfYearSort      260.70      (5.4%)      260.99      
(2.8%)    0.1% (  -7% -    8%) 0.937
         AndHighMedDayTaxoFacets      136.32      (2.6%)      136.63      
(2.3%)    0.2% (  -4% -    5%) 0.773
                IntervalsOrdered      128.13      (7.5%)      128.45      
(7.7%)    0.3% ( -13% -   16%) 0.916
        AndHighHighDayTaxoFacets       13.82      (2.8%)       13.87      
(2.6%)    0.4% (  -4% -    5%) 0.657
                          Fuzzy1       79.16      (2.7%)       79.60      
(1.8%)    0.6% (  -3% -    5%) 0.433
                   TermMonthSort      360.17      (6.4%)      362.83      
(7.1%)    0.7% ( -11% -   15%) 0.728
                   TermTitleSort      191.21      (6.8%)      192.70      
(7.1%)    0.8% ( -12% -   15%) 0.723
                      TermDTSort      208.40      (2.9%)      210.39      
(2.9%)    1.0% (  -4% -    7%) 0.301
            MedTermDayTaxoFacets       78.66      (5.2%)       79.59      
(4.4%)    1.2% (  -7% -   11%) 0.436
                  TermDateFacets       41.04      (5.4%)       41.61      
(4.7%)    1.4% (  -8% -   12%) 0.385
                          IntNRQ      122.00      (8.1%)      124.08      
(8.3%)    1.7% ( -13% -   19%) 0.513
          OrHighMedDayTaxoFacets       23.16      (8.4%)       23.71      
(4.9%)    2.4% ( -10% -   17%) 0.272
           BrowseMonthSSDVFacets       28.68     (13.8%)       29.55     
(16.8%)    3.0% ( -24% -   39%) 0.531
       BrowseDayOfYearTaxoFacets       30.40     (32.2%)       31.67     
(34.2%)    4.2% ( -47% -  103%) 0.690
            BrowseDateTaxoFacets       30.26     (32.2%)       31.57     
(34.4%)    4.3% ( -47% -  104%) 0.680
                         Prefix3      402.14      (8.6%)      419.96      
(8.9%)    4.4% ( -12% -   23%) 0.109
                AndMedOrHighHigh       94.79      (4.0%)       99.03      
(4.5%)    4.5% (  -3% -   13%) 0.001
     BrowseRandomLabelTaxoFacets       32.45     (49.2%)       35.05     
(53.4%)    8.0% ( -63% -  217%) 0.622
           BrowseMonthTaxoFacets       28.68     (35.3%)       31.37     
(39.1%)    9.4% ( -48% -  129%) 0.425
            BrowseDateSSDVFacets        3.96     (28.1%)        4.54     
(26.3%)   

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-10 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564747#comment-17564747
 ] 

Zach Chen commented on LUCENE-10480:


{quote}I'll see if I can run the original nightly benchmark code / tests from 
my machine to see if there's any difference.
{quote}
I tried to run ** *nightlyBench.py* locally on my machine over the weekend, but 
that turns out to require some changes to the script itself,  and I haven't 
been able to run it fully so far.

On the other hand, I tried a few more run configurations with ** *localrun.py,* 
including running it in a virtual ubuntu box  (as the nightly benchmark runs on 
linux box), but still have no luck so far re-producing the 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 slow-down. 

[~jpountz], just curious, are you able to reproduce the slow-down locally on 
your end as well ?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564611#comment-17564611
 ] 

Zach Chen commented on LUCENE-10480:


{quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 recovered fully but 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 only a bit. I'm unsure what explains there is still a slowdown compared to BMW.
{quote}
Hmm this is quite strange. Looks like 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 was still having about -13%  (5 / 38) impact.  I just ran the full suite of 
wikinightly tasks a few times (by copying *wikinightly.tasks* into 
*wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source 
*wikimedium10m,* and removing *VectorSearch* queries as they were causing 
failure NPE for me) but couldn't reproduce the slow down (baseline is using 
head before all BMM changes):

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets       20.83      (3.8%)       20.09      
(6.5%)   -3.6% ( -13% -    6%) 0.034
           BrowseMonthSSDVFacets       30.36     (10.6%)       29.56     
(12.7%)   -2.7% ( -23% -   23%) 0.473
                         Prefix3      402.70      (9.3%)      397.59      
(9.9%)   -1.3% ( -18% -   19%) 0.674
               TermDayOfYearSort      183.55      (6.5%)      181.61      
(6.9%)   -1.1% ( -13% -   13%) 0.617
                   TermTitleSort      195.99      (7.2%)      194.25      
(8.1%)   -0.9% ( -15% -   15%) 0.713
                        PKLookup      293.80      (3.7%)      291.47      
(4.8%)   -0.8% (  -8% -    7%) 0.555
                   TermMonthSort      283.86      (7.1%)      281.74      
(8.0%)   -0.7% ( -14% -   15%) 0.755
                        Wildcard      227.26      (6.2%)      225.87      
(6.4%)   -0.6% ( -12% -   12%) 0.759
                            Term     2227.50      (3.7%)     2219.57      
(3.3%)   -0.4% (  -7% -    6%) 0.748
                          Fuzzy1      134.77      (2.8%)      134.37      
(2.3%)   -0.3% (  -5% -    4%) 0.712
                    TermGroup100       53.61      (3.7%)       53.47      
(4.6%)   -0.3% (  -8% -    8%) 0.846
                      TermDTSort      143.16      (3.2%)      142.89      
(3.3%)   -0.2% (  -6% -    6%) 0.857
                  TermBGroup1M1P       79.44      (5.5%)       79.29      
(5.5%)   -0.2% ( -10% -   11%) 0.917
        AndHighHighDayTaxoFacets       45.01      (2.3%)       44.94      
(2.1%)   -0.1% (  -4% -    4%) 0.833
     BrowseRandomLabelTaxoFacets       30.94     (50.0%)       30.92     
(46.8%)   -0.0% ( -64% -  193%) 0.998
         AndHighMedDayTaxoFacets       78.11      (3.2%)       78.11      
(3.0%)   -0.0% (  -6% -    6%) 0.998
                          Phrase      202.17      (2.7%)      202.18      
(2.0%)    0.0% (  -4% -    4%) 0.996
                          Fuzzy2       76.10      (2.6%)       76.15      
(2.0%)    0.1% (  -4% -    4%) 0.933
                     TermGroup1M       22.65      (3.8%)       22.67      
(3.2%)    0.1% (  -6% -    7%) 0.919
                  TermDateFacets       32.50      (5.3%)       32.60      
(5.5%)    0.3% (  -9% -   11%) 0.861
       BrowseDayOfYearSSDVFacets       26.31      (5.9%)       26.39      
(8.5%)    0.3% ( -13% -   15%) 0.897
                         Respell       88.21      (2.2%)       88.49      
(2.1%)    0.3% (  -3% -    4%) 0.642
                        SpanNear       16.14      (4.0%)       16.22      
(4.2%)    0.5% (  -7% -    9%) 0.706
            MedTermDayTaxoFacets       73.42      (4.8%)       73.85      
(4.9%)    0.6% (  -8% -   10%) 0.708
                    TermBGroup1M       48.92      (4.2%)       49.23      
(2.8%)    0.6% (  -6% -    8%) 0.581
                IntervalsOrdered       22.42      (5.8%)       22.59      
(4.2%)    0.7% (  -8% -   11%) 0.651
          OrHighMedDayTaxoFacets       25.27      (6.1%)       25.46      
(6.6%)    0.7% ( -11% -   14%) 0.711
                    TermGroup10K       30.26      (4.2%)       30.50      
(2.9%)    0.8% (  -6% -    8%) 0.494
                    SloppyPhrase       91.40      (5.6%)       92.16      
(6.3%)    0.8% ( -10% -   13%) 0.662
                          IntNRQ      152.74     (20.3%)      154.86     
(17.1%)    1.4% ( -29% -   48%) 0.815
                      AndHighMed       88.55      (2.6%)       89.98      
(3.1%)    1.6% (  -3% -    7%) 0.073
                     AndHighHigh       29.10      (2.7%)       29.68      
(3.1%)    2.0% (  -3% -    8%) 0.032
       BrowseDayOfYearTaxoFacets       31.29     (40.0%)       31.93     
(38.0%)    2.0% ( -54% -  133%) 0.869
            BrowseDateTaxoFacets       31.18     (40.3%)       31.87     
(38.5%)    2.2% ( -54% -  

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564611#comment-17564611
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/9/22 7:25 PM:


{quote}[AndMedOrHighHigh|https://home.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html]
 recovered fully but 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 only a bit. I'm unsure what explains there is still a slowdown compared to BMW.
{quote}
Hmm this is quite strange. Looks like 
[AndHighOrMedMed|https://home.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 was still having about -13%  (5 / 38) impact.  I just ran the full suite of 
wikinightly tasks a few times (by copying *wikinightly.tasks* into 
*wikimedium.10M.nostopwords.tasks* and running *localrun.py* with source 
*wikimedium10m,* and removing *VectorSearch* queries as they were causing 
failure NPE for me) but couldn't reproduce the slow down (baseline is using 
head before all BMM changes):

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
     BrowseRandomLabelSSDVFacets       20.83      (3.8%)       20.09      
(6.5%)   -3.6% ( -13% -    6%) 0.034
           BrowseMonthSSDVFacets       30.36     (10.6%)       29.56     
(12.7%)   -2.7% ( -23% -   23%) 0.473
                         Prefix3      402.70      (9.3%)      397.59      
(9.9%)   -1.3% ( -18% -   19%) 0.674
               TermDayOfYearSort      183.55      (6.5%)      181.61      
(6.9%)   -1.1% ( -13% -   13%) 0.617
                   TermTitleSort      195.99      (7.2%)      194.25      
(8.1%)   -0.9% ( -15% -   15%) 0.713
                        PKLookup      293.80      (3.7%)      291.47      
(4.8%)   -0.8% (  -8% -    7%) 0.555
                   TermMonthSort      283.86      (7.1%)      281.74      
(8.0%)   -0.7% ( -14% -   15%) 0.755
                        Wildcard      227.26      (6.2%)      225.87      
(6.4%)   -0.6% ( -12% -   12%) 0.759
                            Term     2227.50      (3.7%)     2219.57      
(3.3%)   -0.4% (  -7% -    6%) 0.748
                          Fuzzy1      134.77      (2.8%)      134.37      
(2.3%)   -0.3% (  -5% -    4%) 0.712
                    TermGroup100       53.61      (3.7%)       53.47      
(4.6%)   -0.3% (  -8% -    8%) 0.846
                      TermDTSort      143.16      (3.2%)      142.89      
(3.3%)   -0.2% (  -6% -    6%) 0.857
                  TermBGroup1M1P       79.44      (5.5%)       79.29      
(5.5%)   -0.2% ( -10% -   11%) 0.917
        AndHighHighDayTaxoFacets       45.01      (2.3%)       44.94      
(2.1%)   -0.1% (  -4% -    4%) 0.833
     BrowseRandomLabelTaxoFacets       30.94     (50.0%)       30.92     
(46.8%)   -0.0% ( -64% -  193%) 0.998
         AndHighMedDayTaxoFacets       78.11      (3.2%)       78.11      
(3.0%)   -0.0% (  -6% -    6%) 0.998
                          Phrase      202.17      (2.7%)      202.18      
(2.0%)    0.0% (  -4% -    4%) 0.996
                          Fuzzy2       76.10      (2.6%)       76.15      
(2.0%)    0.1% (  -4% -    4%) 0.933
                     TermGroup1M       22.65      (3.8%)       22.67      
(3.2%)    0.1% (  -6% -    7%) 0.919
                  TermDateFacets       32.50      (5.3%)       32.60      
(5.5%)    0.3% (  -9% -   11%) 0.861
       BrowseDayOfYearSSDVFacets       26.31      (5.9%)       26.39      
(8.5%)    0.3% ( -13% -   15%) 0.897
                         Respell       88.21      (2.2%)       88.49      
(2.1%)    0.3% (  -3% -    4%) 0.642
                        SpanNear       16.14      (4.0%)       16.22      
(4.2%)    0.5% (  -7% -    9%) 0.706
            MedTermDayTaxoFacets       73.42      (4.8%)       73.85      
(4.9%)    0.6% (  -8% -   10%) 0.708
                    TermBGroup1M       48.92      (4.2%)       49.23      
(2.8%)    0.6% (  -6% -    8%) 0.581
                IntervalsOrdered       22.42      (5.8%)       22.59      
(4.2%)    0.7% (  -8% -   11%) 0.651
          OrHighMedDayTaxoFacets       25.27      (6.1%)       25.46      
(6.6%)    0.7% ( -11% -   14%) 0.711
                    TermGroup10K       30.26      (4.2%)       30.50      
(2.9%)    0.8% (  -6% -    8%) 0.494
                    SloppyPhrase       91.40      (5.6%)       92.16      
(6.3%)    0.8% ( -10% -   13%) 0.662
                          IntNRQ      152.74     (20.3%)      154.86     
(17.1%)    1.4% ( -29% -   48%) 0.815
                      AndHighMed       88.55      (2.6%)       89.98      
(3.1%)    1.6% (  -3% -    7%) 0.073
                     AndHighHigh       29.10      (2.7%)       29.68      
(3.1%)    2.0% (  -3% -    8%) 0.032
       BrowseDayOfYearTaxoFacets       31.29     (40.0%)       31.93     
(38.0%)    2.0% ( -54% -  133%) 0.869
            BrowseDateTaxoFacets       31.18     (40.3%)  

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-06 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563536#comment-17563536
 ] 

Zach Chen commented on LUCENE-10480:


Ok I see. Maybe I can also try to run some benchmark experiments with different 
JVM compilation / code cache parameters to further test things out. Will report 
back if I find something interesting!

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562944#comment-17562944
 ] 

Zach Chen commented on LUCENE-10480:


{quote}maybe there are bits from advance() that we could move to matches() so 
that we would hand it over to the other clause before we start doing expensive 
operations like computing scores.
{quote}
This approach does help stabilizing performance for disjunction within 
conjunction queries (and also provide some small gains)! I have opened a PR for 
it [https://github.com/apache/lucene/pull/1006] .

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       44.47     
(26.6%)    

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:15 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                    AndHighOrMedMed       44.47     
(26.6%)  

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:13 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                   AndHighOrMedMed       44.47   
  

[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen edited comment on LUCENE-10480 at 7/6/22 2:12 AM:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       

[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-07-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562919#comment-17562919
 ] 

Zach Chen commented on LUCENE-10480:


{quote}Nightly benchmarks picked up the change and top-level disjunctions are 
seeing massive speedups, see 
[OrHighHigh|http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html] 
or [OrHighMed|http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html]. 
However disjunctions within conjunctions got a slowdown, see 
[AndHighOrMedMed|http://people.apache.org/~mikemccand/lucenebench/AndHighOrMedMed.html]
 or 
[AndMedOrHighHigh|http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html].
{quote}
The results look encouraging and interesting! I copied and pasted the boolean 
queries from *wikinightly.tasks* into 

*wikimedium.10M.nostopwords.tasks* and ran the benchmark, and was able to 
re-produce the slow-down: 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed      108.16      (6.5%)      100.44      
(5.4%)   -7.1% ( -17% -    5%) 0.000
                AndMedOrHighHigh       68.37      (4.5%)       63.92      
(5.0%)   -6.5% ( -15% -    3%) 0.000
                     AndHighHigh      122.90      (5.5%)      122.77      
(5.5%)   -0.1% ( -10% -   11%) 0.952
                      AndHighMed      113.27      (6.4%)      114.63      
(6.2%)    1.2% ( -10% -   14%) 0.546
                        PKLookup      228.08     (14.4%)      232.90     
(14.7%)    2.1% ( -23% -   36%) 0.646
                      OrHighHigh       26.89      (5.7%)       48.62     
(12.2%)   80.8% (  59% -  104%) 0.000
                       OrHighMed       81.18      (5.9%)      187.05     
(12.2%)  130.4% ( 105% -  157%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       85.67      (5.3%)       73.23      
(5.7%)  -14.5% ( -24% -   -3%) 0.000
                        PKLookup      260.08     (13.4%)      253.74     
(14.9%)   -2.4% ( -27% -   29%) 0.586
                     AndHighHigh       73.68      (4.7%)       72.70      
(4.1%)   -1.3% (  -9% -    7%) 0.339
                      AndHighMed       89.52      (5.1%)       88.55      
(4.4%)   -1.1% ( -10% -    8%) 0.470
                 AndHighOrMedMed       63.27      (6.5%)       70.48      
(5.7%)   11.4% (   0% -   25%) 0.000
                      OrHighHigh       19.60      (5.3%)       25.62      
(7.6%)   30.8% (  16% -   46%) 0.000
                       OrHighMed      121.08      (5.7%)      236.34     
(10.2%)   95.2% (  74% -  117%) 0.000 {code}
 

 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                AndMedOrHighHigh       86.88      (3.4%)       76.60      
(3.1%)  -11.8% ( -17% -   -5%) 0.000
                     AndHighHigh       30.49      (3.5%)       30.36      
(3.5%)   -0.4% (  -7% -    6%) 0.697
                      AndHighMed      192.76      (3.4%)      193.72      
(3.9%)    0.5% (  -6% -    8%) 0.671
                        PKLookup      262.59      (5.5%)      264.52      
(7.9%)    0.7% ( -11% -   14%) 0.731
                 AndHighOrMedMed       65.47      (3.8%)       73.43      
(3.0%)   12.2% (   5% -   19%) 0.000
                      OrHighHigh       21.47      (4.1%)       36.94      
(8.3%)   72.1% (  57% -   88%) 0.000
                       OrHighMed       99.91      (4.3%)      292.05     
(12.9%)  192.3% ( 167% -  218%) 0.000 {code}
 

 

However, when I reduced the type of tasks further into just conjunction + 
disjunction (and with default number of search threads), the results actually 
turned positive and were similar to what I saw earlier in 
[https://github.com/apache/lucene/pull/972#issuecomment-1166188875] 
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value
                 AndHighOrMedMed       58.65     (37.3%)       71.63     
(28.9%)   22.1% ( -32% -  140%) 0.036
                AndMedOrHighHigh       36.43     (39.3%)       44.61     
(30.7%)   22.4% ( -34% -  152%) 0.044
                        PKLookup      163.58     (34.4%)      211.88     
(32.7%)   29.5% ( -27% -  147%) 0.005 {code}
{code:java}
                            TaskQPS baseline      StdDevQPS my_modified_version 
     StdDev                Pct diff p-value                         PKLookup    
  146.51     (22.0%)      188.92     (30.1%)   28.9% ( -18% -  103%) 0.001      
           AndMedOrHighHigh       35.59     (27.1%)       49.99     (37.5%)   
40.4% ( -18% -  144%) 0.000                  AndHighOrMedMed       
44.47     (26.6%)       63.37     (35.8%)   

[jira] [Commented] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added

2022-07-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561789#comment-17561789
 ] 

Zach Chen commented on LUCENE-10635:


I like this idea! This approach should also be able to preserve most of the 
assertions in the test utilities. I can give it a try and see how things might 
look.

> Ensure test coverage for WANDScorer after additional scorers get added
> --
>
> Key: LUCENE-10635
> URL: https://issues.apache.org/jira/browse/LUCENE-10635
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Zach Chen
>Priority: Major
>
> This is a follow-up issue from discussions 
> [https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & 
> [https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] .
>  
> As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in 
> TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer 
> instead, reducing test coverage for WANDScorer. We would like to see how we 
> can ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating 
> the scorer directly inside the tests?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10636) Could the partial score sum from essential list scores be cached?

2022-06-30 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10636:
--

 Summary: Could the partial score sum from essential list scores be 
cached?
 Key: LUCENE-10636
 URL: https://issues.apache.org/jira/browse/LUCENE-10636
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Zach Chen


This is a follow-up issue from discussion 
[https://github.com/apache/lucene/pull/972#discussion_r909300200] . Currently 
in the implementation of BlockMaxMaxscoreScorer, there's duplicated computation 
of summing up scores from essential list scorers. We would like to see if this 
duplicated computation can be cached without introducing much overhead or data 
structure that might out-weight the benefit of caching.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10635) Ensure test coverage for WANDScorer after additional scorers get added

2022-06-30 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10635:
--

 Summary: Ensure test coverage for WANDScorer after additional 
scorers get added
 Key: LUCENE-10635
 URL: https://issues.apache.org/jira/browse/LUCENE-10635
 Project: Lucene - Core
  Issue Type: Test
Reporter: Zach Chen


This is a follow-up issue from discussions 
[https://github.com/apache/lucene/pull/972#issuecomment-1170684358] & 
[https://github.com/apache/lucene/pull/972#pullrequestreview-1024377641] .

 

As additional scorers such as BlockMaxMaxscoreScorer get added, some tests in 
TestWANDScorer that used to test WANDScorer now test BlockMaxMaxscoreScorer 
instead, reducing test coverage for WANDScorer. We would like to see how we can 
ensure TestWANDScorer reliably tests WANDScorer, perhaps by initiating the 
scorer directly inside the tests?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10411) Add NN vectors support to ExitableDirectoryReader

2022-06-09 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen resolved LUCENE-10411.

  Assignee: Zach Chen
Resolution: Implemented

> Add NN vectors support to ExitableDirectoryReader
> -
>
> Key: LUCENE-10411
> URL: https://issues.apache.org/jira/browse/LUCENE-10411
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This is currently unsupported.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552556#comment-17552556
 ] 

Zach Chen edited comment on LUCENE-10480 at 6/10/22 5:15 AM:
-

Hi [~jpountz] , this issue reminded me of our experiments last year 
implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% 
improvement for OrHighHigh and OrHighMed 
queries|https://github.com/apache/lucene/pull/101#issuecomment-840255508] . Do 
you think we should continue to explore in that direction, or there might be 
better / simpler algorithms we could look into?


was (Author: zacharymorn):
Hi [~jpountz] , this issue reminded me of our experiments last year 
implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% 
improvement for OrHighHigh and OrHighMed 
queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do 
you think we should continue to explore in that direction, or there might be 
better / simpler algorithms we could look into?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10480) Specialize 2-clauses disjunctions

2022-06-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552556#comment-17552556
 ] 

Zach Chen commented on LUCENE-10480:


Hi [~jpountz] , this issue reminded me of our experiments last year 
implementing BMM scorer for pure disjunction, which [showed about 20% ~ 40% 
improvement for OrHighHigh and OrHighMed 
queries|[https://github.com/apache/lucene/pull/101#issuecomment-840255508].] Do 
you think we should continue to explore in that direction, or there might be 
better / simpler algorithms we could look into?

> Specialize 2-clauses disjunctions
> -
>
> Key: LUCENE-10480
> URL: https://issues.apache.org/jira/browse/LUCENE-10480
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> WANDScorer is nice, but it also has lots of overhead to maintain its 
> invariants: one linked list for the current candidates, one priority queue of 
> scorers that are behind, another one for scorers that are ahead. All this 
> could be simplified in the 2-clauses case, which feels worth specializing for 
> as it's very common that end users enter queries that only have two terms?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?

2022-04-24 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen resolved LUCENE-10436.

Resolution: Done

> Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery into a single FieldExistsQuery?
> --
>
> Key: LUCENE-10436
> URL: https://issues.apache.org/jira/browse/LUCENE-10436
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Now that we require consistency across data structures, we could merge 
> DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require 
> that the field indexes either norms, doc values or vectors?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?

2022-04-24 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen reassigned LUCENE-10436:
--

Assignee: Zach Chen

> Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery into a single FieldExistsQuery?
> --
>
> Key: LUCENE-10436
> URL: https://issues.apache.org/jira/browse/LUCENE-10436
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Now that we require consistency across data structures, we could merge 
> DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require 
> that the field indexes either norms, doc values or vectors?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10411) Add NN vectors support to ExitableDirectoryReader

2022-04-24 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527308#comment-17527308
 ] 

Zach Chen commented on LUCENE-10411:


Hi [~jpountz] , I have created a PR for this. Could you please take a look and 
let me know your thoughts?

> Add NN vectors support to ExitableDirectoryReader
> -
>
> Key: LUCENE-10411
> URL: https://issues.apache.org/jira/browse/LUCENE-10411
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is currently unsupported.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10436) Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery into a single FieldExistsQuery?

2022-03-25 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512660#comment-17512660
 ] 

Zach Chen commented on LUCENE-10436:


Hi [~jpountz] , I took a look and created a PR for this 
[https://github.com/apache/lucene/pull/767] . Could you please let me know if 
it looks good to you?

> Combine DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery into a single FieldExistsQuery?
> --
>
> Key: LUCENE-10436
> URL: https://issues.apache.org/jira/browse/LUCENE-10436
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that we require consistency across data structures, we could merge 
> DocValuesFieldExistsQuery, NormsFieldExistsQuery and 
> KnnVectorFieldExistsQuery together into a FieldExistsQuery that would require 
> that the field indexes either norms, doc values or vectors?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2022-02-01 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen resolved LUCENE-10236.

Resolution: Fixed

> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently CombinedFieldsQuery would use a constructed 
> [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
>  object to create a MultiNormsLeafSimScorer for scoring, but the fields 
> object may contain duplicated field-weight pairs as it is [built from looping 
> over 
> fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
>  resulting into duplicated norms being added during scoring calculation in 
> MultiNormsLeafSimScorer. 
> E.g. for CombinedFieldsQuery with two fields and two values matching a 
> particular doc:
> {code:java}
> CombinedFieldQuery query =
> new CombinedFieldQuery.Builder()
> .addField("field1", (float) 1.0)
> .addField("field2", (float) 1.0)
> .addTerm(new BytesRef("foo"))
> .addTerm(new BytesRef("zoo"))
> .build(); {code}
> I would imagine the scoring to be based on the following:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2)
> but the current logic would use the following for scoring:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
> norm(field2)
>  
> In addition, this differs from how MultiNormsLeafSimScorer is constructed 
> from CombinedFieldsQuery explain function, which [uses 
> fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
>  and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2022-02-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485556#comment-17485556
 ] 

Zach Chen commented on LUCENE-9662:
---

I've approved the null check PR. Thanks [~mdrob] !

For resolving this issue, I think so? So far the implementation has 
parallelized checking across segments, but within each segment it's still 
sequential. We initially started from parallelizing within each segment, but 
had found the speed-up to be limited as its dominated by checking the biggest 
parts within segment (typically the posting file checked by `testPostings`). We 
could potentially look into breaking that up to smaller pieces to increase 
parallelization, but not sure if it's worth the effort / complexity in code. 
What do you think [~mikemccand] ? 

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 20h 50m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10183) KnnVectorsWriter#writeField should take a KnnVectorsReader, not a VectorValues instance

2021-12-10 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457534#comment-17457534
 ] 

Zach Chen commented on LUCENE-10183:


Hi [~jpountz] , I've opened a PR for this issue. Please let me know if looks 
good to you.

> KnnVectorsWriter#writeField should take a KnnVectorsReader, not a 
> VectorValues instance
> ---
>
> Key: LUCENE-10183
> URL: https://issues.apache.org/jira/browse/LUCENE-10183
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> By taking a VectorValues instance, KnnVectorsWriter#write doesn't let 
> implementations iterate over vectors multiple times if needed. It should take 
> a KnnVectorReaders similarly to doc values, where the writer takes a 
> DocValuesProducer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2021-11-15 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-10236:
---
Description: 
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

 

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 

  was:
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:

 
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
 

I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 


> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently 

[jira] [Created] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2021-11-15 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10236:
--

 Summary: CombinedFieldsQuery to use fieldAndWeights.values() when 
constructing MultiNormsLeafSimScorer for scoring
 Key: LUCENE-10236
 URL: https://issues.apache.org/jira/browse/LUCENE-10236
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/sandbox
Reporter: Zach Chen
Assignee: Zach Chen


This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:

 
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
 

I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery

2021-11-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444254#comment-17444254
 ] 

Zach Chen commented on LUCENE-10212:


No problem [~julietibs] ! Glad to be able to contribute! 

> Add luceneutil benchmark task for CombinedFieldsQuery
> -
>
> Key: LUCENE-10212
> URL: https://issues.apache.org/jira/browse/LUCENE-10212
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>
> This is a spin-off task from 
> https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively 
> evaluate performance changes for CombinedFieldsQuery, we would like to  add 
> benchmark task and parsing for CombinedFieldsQuery.
> One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
> would be the following:
> {code:java}
> taskName: term1 term2 term3 term4 
> +combinedFields=field1^boost1,field2^boost2,field3^boost3
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM:
--

{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it doesn't get written to BooleanQuery 
for most of the 
time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!


was (Author: zacharymorn):
{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is not used there. In addition, the PR is already doing 
Maxscore-like calculation based on competitive impacts to skip docs. Am I 
missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM:
--

{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it very much doesn't get re-written to 
BooleanQuery|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!


was (Author: zacharymorn):
{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it doesn't get written to BooleanQuery 
for most of the 
time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873
 ] 

Zach Chen commented on LUCENE-10061:


{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is not used there. In addition, the PR is already doing 
Maxscore-like calculation based on competitive impacts to skip docs. Am I 
missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-04 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/5/21, 4:50 AM:
--

Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following performance test results (perf test index rebuilt to 
enable norm for title field, task file attached, and luceneutil integration 
available at 
[https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:]
{code:java}
 # Run 1
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        4.64      (6.5%)        3.30      (4.7%)  -29.0% ( 
-37% -  -19%) 0.000
         CFQHighHigh       11.09      (6.0%)        9.61      (6.0%)  -13.3% ( 
-23% -   -1%) 0.000
            PKLookup      103.38      (4.4%)      108.04      (4.3%)    4.5% (  
-4% -   13%) 0.001
       CFQHighMedLow       10.58      (6.1%)       12.30      (8.7%)   16.2% (  
 1% -   33%) 0.000
          CFQHighMed       10.70      (7.4%)       15.51     (11.2%)   44.9% (  
24% -   68%) 0.000
       CFQHighLowLow        8.18      (8.2%)       12.87     (11.6%)   57.3% (  
34% -   84%) 0.000
          CFQHighLow       14.57      (7.5%)       30.81     (15.1%)  111.4% (  
82% -  144%) 0.000


# Run 2
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        5.33      (5.7%)        4.02      (7.7%)  -24.4% ( 
-35% -  -11%) 0.000
       CFQHighLowLow       17.14      (6.2%)       13.06      (5.4%)  -23.8% ( 
-33% -  -13%) 0.000
          CFQHighMed       17.37      (5.8%)       14.38      (7.7%)  -17.2% ( 
-29% -   -3%) 0.000
            PKLookup      103.57      (5.5%)      108.84      (5.9%)    5.1% (  
-6% -   17%) 0.005
       CFQHighMedLow       11.25      (7.2%)       12.70      (9.0%)   12.9% (  
-3% -   31%) 0.000
         CFQHighHigh        5.00      (6.2%)        7.54     (12.1%)   51.0% (  
30% -   73%) 0.000
          CFQHighLow       21.60      (5.2%)       34.57     (14.1%)   60.0% (  
38% -   83%) 0.000


# Run 3
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        5.40      (6.9%)        4.06      (5.1%)  -24.8% ( 
-34% -  -13%) 0.000
       CFQHighMedLow        7.64      (7.4%)        5.79      (6.3%)  -24.2% ( 
-35% -  -11%) 0.000
         CFQHighHigh       11.11      (7.0%)        9.60      (5.9%)  -13.6% ( 
-24% -    0%) 0.000
       CFQHighLowLow       21.21      (7.6%)       21.22      (6.6%)    0.0% ( 
-13% -   15%) 0.993
            PKLookup      103.15      (5.9%)      107.60      (6.9%)    4.3% (  
-8% -   18%) 0.034
          CFQHighLow       21.85      (8.1%)       34.18     (13.5%)   56.4% (  
32% -   84%) 0.000
          CFQHighMed       12.07      (8.4%)       19.98     (16.7%)   65.5% (  
37% -   98%) 0.000


# Run 4
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
         CFQHighHigh        8.50      (5.8%)        6.85      (5.2%)  -19.5% ( 
-28% -   -8%) 0.000
       CFQHighMedLow       10.89      (5.7%)        8.96      (5.4%)  -17.8% ( 
-27% -   -7%) 0.000
          CFQHighMed        8.41      (5.8%)        7.74      (5.6%)   -7.9% ( 
-18% -    3%) 0.000
     CFQHighHighHigh        3.45      (6.7%)        3.38      (5.3%)   -2.0% ( 
-13% -   10%) 0.287
       CFQHighLowLow        7.82      (6.4%)        8.20      (7.5%)    4.8% (  
-8% -   20%) 0.030
            PKLookup      103.50      (5.0%)      110.69      (5.4%)    6.9% (  
-3% -   18%) 0.000
          CFQHighLow       11.46      (6.0%)       13.16      (6.7%)   14.8% (  
 1% -   29%) 0.000
{code}
I think overall this shows that the pruning will be most effective when there's 
a significant difference between terms' frequencies, but will slow things down 
if they are close, as the cost of pruning outweighs the efficacy of skipping. 
I'm wondering if we should then gate the pruning by checking the frequencies as 
well, but from some quick trials that seems to be an expensive operation? Do 
you have any recommendation for this scenario?


was (Author: zacharymorn):
Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following 

[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-04 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028
 ] 

Zach Chen commented on LUCENE-10061:


Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following performance test results (perf test index rebuilt to 
enable norm for title field, task file attached, and luceneutil integration 
available at 
[https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:]
{code:java}
Run 1
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 CFQHighHighHigh4.64  (6.5%)3.30  
(4.7%)  -29.0% ( -37% -  -19%) 0.000
 CFQHighHigh   11.09  (6.0%)9.61  
(6.0%)  -13.3% ( -23% -   -1%) 0.000
PKLookup  103.38  (4.4%)  108.04  
(4.3%)4.5% (  -4% -   13%) 0.001
   CFQHighMedLow   10.58  (6.1%)   12.30  
(8.7%)   16.2% (   1% -   33%) 0.000
  CFQHighMed   10.70  (7.4%)   15.51 
(11.2%)   44.9% (  24% -   68%) 0.000
   CFQHighLowLow8.18  (8.2%)   12.87 
(11.6%)   57.3% (  34% -   84%) 0.000
  CFQHighLow   14.57  (7.5%)   30.81 
(15.1%)  111.4% (  82% -  144%) 0.000

Run 2
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 CFQHighHighHigh5.33  (5.7%)4.02  
(7.7%)  -24.4% ( -35% -  -11%) 0.000
   CFQHighLowLow   17.14  (6.2%)   13.06  
(5.4%)  -23.8% ( -33% -  -13%) 0.000
  CFQHighMed   17.37  (5.8%)   14.38  
(7.7%)  -17.2% ( -29% -   -3%) 0.000
PKLookup  103.57  (5.5%)  108.84  
(5.9%)5.1% (  -6% -   17%) 0.005
   CFQHighMedLow   11.25  (7.2%)   12.70  
(9.0%)   12.9% (  -3% -   31%) 0.000
 CFQHighHigh5.00  (6.2%)7.54 
(12.1%)   51.0% (  30% -   73%) 0.000
  CFQHighLow   21.60  (5.2%)   34.57 
(14.1%)   60.0% (  38% -   83%) 0.000

Run 3
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 CFQHighHighHigh5.40  (6.9%)4.06  
(5.1%)  -24.8% ( -34% -  -13%) 0.000
   CFQHighMedLow7.64  (7.4%)5.79  
(6.3%)  -24.2% ( -35% -  -11%) 0.000
 CFQHighHigh   11.11  (7.0%)9.60  
(5.9%)  -13.6% ( -24% -0%) 0.000
   CFQHighLowLow   21.21  (7.6%)   21.22  
(6.6%)0.0% ( -13% -   15%) 0.993
PKLookup  103.15  (5.9%)  107.60  
(6.9%)4.3% (  -8% -   18%) 0.034
  CFQHighLow   21.85  (8.1%)   34.18 
(13.5%)   56.4% (  32% -   84%) 0.000
  CFQHighMed   12.07  (8.4%)   19.98 
(16.7%)   65.5% (  37% -   98%) 0.000

Run 4
TaskQPS baseline  StdDevQPS my_modified_version 
 StdDevPct diff p-value
 CFQHighHigh8.50  (5.8%)6.85  
(5.2%)  -19.5% ( -28% -   -8%) 0.000
   CFQHighMedLow   10.89  (5.7%)8.96  
(5.4%)  -17.8% ( -27% -   -7%) 0.000
  CFQHighMed8.41  (5.8%)7.74  
(5.6%)   -7.9% ( -18% -3%) 0.000
 CFQHighHighHigh3.45  (6.7%)3.38  
(5.3%)   -2.0% ( -13% -   10%) 0.287
   CFQHighLowLow7.82  (6.4%)8.20  
(7.5%)4.8% (  -8% -   20%) 0.030
PKLookup  103.50  (5.0%)  110.69  
(5.4%)6.9% (  -3% -   18%) 0.000
  CFQHighLow   11.46  (6.0%)   13.16  
(6.7%)   14.8% (   1% -   29%) 0.000
{code}
I think overall this shows that the pruning will be most effective when there's 
a significant difference between terms' frequencies, but will slow things down 
if they are close, as the cost of pruning outweighs the efficacy of skipping. 
I'm wondering if we should then gate the pruning by checking the frequencies as 
well, but from some quick trials that seems to be an expensive operation? Do 
you have any recommendation for this scenario?

> CombinedFieldsQuery needs dynamic pruning 

[jira] [Updated] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-04 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-10061:
---
Attachment: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-10-29 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436233#comment-17436233
 ] 

Zach Chen commented on LUCENE-10061:


Thanks [~jpountz] for the pointer! I have created a spin-off task for 
luceneutil integration https://issues.apache.org/jira/browse/LUCENE-10212, and 
will actually work on it first and circle back to this task afterward. 

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery

2021-10-29 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-10212:
---
Description: 
This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 
. In order to objectively evaluate performance changes for CombinedFieldsQuery, 
we would like to  add benchmark task and parsing for CombinedFieldsQuery.

One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
would be the following:
{code:java}
taskName: term1 term2 term3 term4 
+combinedFields=field1^boost1,field2^boost2,field3^boost3
{code}
 

 

 

  was:
This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 
. In order to objectively evaluate performance changes for CombinedFieldsQuery, 
we would like to  add benchmark task and parsing for CombinedFieldsQuery.

One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
would be the following:

 
{code:java}
taskName: term1 term2 term3 term4 
+combinedFields=field1^boost1,field2^boost2,field3^boost3
{code}
 

 

 


> Add luceneutil benchmark task for CombinedFieldsQuery
> -
>
> Key: LUCENE-10212
> URL: https://issues.apache.org/jira/browse/LUCENE-10212
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>
> This is a spin-off task from 
> https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively 
> evaluate performance changes for CombinedFieldsQuery, we would like to  add 
> benchmark task and parsing for CombinedFieldsQuery.
> One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
> would be the following:
> {code:java}
> taskName: term1 term2 term3 term4 
> +combinedFields=field1^boost1,field2^boost2,field3^boost3
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery

2021-10-29 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10212:
--

 Summary: Add luceneutil benchmark task for CombinedFieldsQuery
 Key: LUCENE-10212
 URL: https://issues.apache.org/jira/browse/LUCENE-10212
 Project: Lucene - Core
  Issue Type: Task
Reporter: Zach Chen
Assignee: Zach Chen


This is a spin-off task from https://issues.apache.org/jira/browse/LUCENE-10061 
. In order to objectively evaluate performance changes for CombinedFieldsQuery, 
we would like to  add benchmark task and parsing for CombinedFieldsQuery.

One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
would be the following:

 
{code:java}
taskName: term1 term2 term3 term4 
+combinedFields=field1^boost1,field2^boost2,field3^boost3
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-10-29 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435789#comment-17435789
 ] 

Zach Chen commented on LUCENE-10061:


Thanks for the confirmation [~jpountz]! I've actually given it a try in the 
last few days and just opened a WIP PR 
[https://github.com/apache/lucene/pull/418] for it, before seeing your comment 
above.

>From the results of a few samples (documented in the PR), assuming there's no 
>bug in the implementation, it does seem that the basic pruning would be most 
>effective in the overall performance when there's significant difference in 
>terms' doc frequencies (HighLow), but would indeed slow down when doc 
>frequencies are close (HighHigh / HighMed) and very likely the overhead of 
>combinatorial calculation / pruning logic outweighs the benefit of skipping. I 
>will try to implement your optimization idea above as well and see how it 
>performs.

In addition, I have been searching around to see if I can leverage luceneutil 
for benchmarking, but I can't seem to find a way to express combined fields 
query like those in 
[https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.tasks]
 . I'm wondering if you may have any pointer for that as well?

 

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-10-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429522#comment-17429522
 ] 

Zach Chen commented on LUCENE-10061:


Hi [~jpountz], I'm interested in working on this one, but have a question on 
its potential implementation and would like to get some advices for it.

I found https://issues.apache.org/jira/browse/LUCENE-8312 during research for 
this, and thought the solution should be very similar here (using merged 
impacts to prune docs that are not competitive), except for maybe how impacts 
get merged. However, while I understand for SynonymQuery, impacts can be merged 
effectively by summing term frequencies for each unique norm value as the 
impacts all come from the same field, I'm not sure how that could be done 
efficiently in the case of CombinedFieldsQuery. If I understand it correctly, 
in order to merge impacts from multiple fields for CombinedFieldsQuery, we may 
need to compute all the possible summation combinations of competitive \{freq, 
norm} across all fields, and find again the competitive ones among them. So for 
the case of 4 fields with a list of 4 competitive impacts each during impacts 
merge, in the worst case we may need to compute 4 * 4 * 4 * 4 = 256 
combinations of merged impacts (\{field1FreqA + field2FreqB + field3FreqC + 
field4FreqD, field1NormA + field2NormB + field3NormC + field4NormD}), and then 
filter out the ones that are not competitive. This seems to be inefficient.

I'm wondering if you may have any suggestion on this, or if using impacts for 
CombinedFieldsQuery pruning support is the right approach to begin with?

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10092) TestCheckIndex failure

2021-09-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412111#comment-17412111
 ] 

Zach Chen commented on LUCENE-10092:


Thanks Michael! I appreciate it!

> TestCheckIndex failure
> --
>
> Key: LUCENE-10092
> URL: https://issues.apache.org/jira/browse/LUCENE-10092
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> We got the below test failure on Elastic's CI:
> {noformat}
> 10:07:08 org.apache.lucene.index.TestCheckIndex > test suite's output saved 
> to 
> /var/lib/jenkins/workspace/apache+lucene-solr+main/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestCheckIndex.txt,
>  copied below:
> 10:07:08> java.lang.AssertionError: expected:<1> but was:<3>
> 10:07:08> at 
> __randomizedtesting.SeedInfo.seed([60A890FDD81D376A:CD05D1B3AF48278E]:0)
> 10:07:08> at org.junit.Assert.fail(Assert.java:89)
> 10:07:08> at org.junit.Assert.failNotEquals(Assert.java:835)
> 10:07:08> at org.junit.Assert.assertEquals(Assert.java:647)
> 10:07:08> at org.junit.Assert.assertEquals(Assert.java:633)
> 10:07:08> at 
> org.apache.lucene.index.TestCheckIndex.testCheckIndexAllValid(TestCheckIndex.java:132)
> 10:07:08> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 10:07:08> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 10:07:08> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 10:07:08> at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
> 10:07:08> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> 10:07:08> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> 10:07:08> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> 

[jira] [Commented] (LUCENE-10092) TestCheckIndex failure

2021-09-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412105#comment-17412105
 ] 

Zach Chen commented on LUCENE-10092:


Sorry for (another) TestCheckIndex failure! The fix above looks good to me.

> TestCheckIndex failure
> --
>
> Key: LUCENE-10092
> URL: https://issues.apache.org/jira/browse/LUCENE-10092
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>
> We got the below test failure on Elastic's CI:
> {noformat}
> 10:07:08 org.apache.lucene.index.TestCheckIndex > test suite's output saved 
> to 
> /var/lib/jenkins/workspace/apache+lucene-solr+main/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestCheckIndex.txt,
>  copied below:
> 10:07:08> java.lang.AssertionError: expected:<1> but was:<3>
> 10:07:08> at 
> __randomizedtesting.SeedInfo.seed([60A890FDD81D376A:CD05D1B3AF48278E]:0)
> 10:07:08> at org.junit.Assert.fail(Assert.java:89)
> 10:07:08> at org.junit.Assert.failNotEquals(Assert.java:835)
> 10:07:08> at org.junit.Assert.assertEquals(Assert.java:647)
> 10:07:08> at org.junit.Assert.assertEquals(Assert.java:633)
> 10:07:08> at 
> org.apache.lucene.index.TestCheckIndex.testCheckIndexAllValid(TestCheckIndex.java:132)
> 10:07:08> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 10:07:08> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 10:07:08> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 10:07:08> at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
> 10:07:08> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> 10:07:08> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
> 10:07:08> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> 10:07:08> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> 10:07:08> at 
> 

[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2021-09-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411723#comment-17411723
 ] 

Zach Chen commented on LUCENE-9662:
---

{quote}I think we should backport these changes, in general.  They are not 
breaking – the switch to {{CheckIndexException}} still subclasses 
{{RuntimeException}}.  There will be some Lucene users who are nervous about 
upgrading to 9.0 too soon, but would be maybe eager to upgrade to last 8.x 
release (if that's 8.10 or 8.11 or beyond).  I think it's bad if we slow down 
our rate of backporting because a major release is coming ... let's try to 
review your backport commit carefully to see if it looks OK?
{quote}
Makes sense. I think my nervousness was also partly due to this change, when 
backported, might be a bit too close to the 8.10 branch cut window, but it 
seems like it's ok for us to just backport and release these changes via 8.11 ?

For now I've created a PR for backporting them against 8x here 
https://github.com/apache/lucene-solr/pull/2567. The merge conflict resolution 
turned out to be less involved than I expected, but there was a failing test 
and I suspected some unintended code was introduced during merge. I will dig in 
a bit more to confirm the cause there.  

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 19h
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9662) CheckIndex should be concurrent

2021-09-04 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410031#comment-17410031
 ] 

Zach Chen edited comment on LUCENE-9662 at 9/4/21, 7:28 PM:


Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but 
noticed that since changes in this PR touched many places in CheckIndex (the 
replacement of *RuntimeException* with *CheckIndexException* in particular), 
and some earlier commits that also touched on CheckIndex were not backported to 
8x since they were intended for 9.0 release, the backporting I was trying 
resulted into many merge conflicts. Although some of the conflicts were easy to 
resolve, I'm a bit concerned that I may introduce subtle bugs when resolving 
conflicts for others since I may not be familiar with those.

What do you think? Would you recommend we still try to backport these changes 
to 8x?


was (Author: zacharymorn):
Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but 
noticed that since changes in this PR touched many places in CheckIndex (the 
replacement of *RuntimeException* with *CheckIndexException* in particular), 
and some earlier commits that also touched on CheckIndex were not backported to 
8x since they were intended for 9.0 release, the backporting I was trying 
resulted into many merge conflicts. Although some of the conflicts were easy to 
resolve, I'm a bit concerned that I may introduce subtle bugs when resolving 
conflicts for others since I may not be familiar with those.

 

What do you think? Would you recommend we still try to backport these changes 
to 8x?

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2021-09-04 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410031#comment-17410031
 ] 

Zach Chen commented on LUCENE-9662:
---

Hi [~mikemccand], I've tried to backport these changes to 8x earlier, but 
noticed that since changes in this PR touched many places in CheckIndex (the 
replacement of *RuntimeException* with *CheckIndexException* in particular), 
and some earlier commits that also touched on CheckIndex were not backported to 
8x since they were intended for 9.0 release, the backporting I was trying 
resulted into many merge conflicts. Although some of the conflicts were easy to 
resolve, I'm a bit concerned that I may introduce subtle bugs when resolving 
conflicts for others since I may not be familiar with those.

 

What do you think? Would you recommend we still try to backport these changes 
to 8x?

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9662) CheckIndex should be concurrent

2021-09-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409182#comment-17409182
 ] 

Zach Chen edited comment on LUCENE-9662 at 9/3/21, 1:06 AM:


{quote}Of course, this is on [ridiculously concurrent (256 cores with 
hyperthreading) 
hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html],
 but still it is only using the default 4 concurrent threads right?  I'll add 
an annotation, and increase its concurrency some!
{quote}
Yes it's indeed capped at 4 threads by default, and the result was indeed 
impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, 
I got about 73% processing time reduction when using '-threadCount 12' versus 
sequential. To increase its concurrency for nightly benchmark, I assume a 
change can be made in 
[luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704]
 to pass in the flag? If so, I can open a PR for it as well!

 
{quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise 
the new {{-threadCount}} option.  [~zacharymorn] could you open a quick 
followup PR?  Thanks!
{quote}
Ah yes sorry for missing that. I've opened a PR for updating it 
[https://github.com/apache/lucene/pull/281]


was (Author: zacharymorn):
{quote}Of course, this is on [ridiculously concurrent (256 cores with 
hyperthreading) 
hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html],
 but still it is only using the default 4 concurrent threads right?  I'll add 
an annotation, and increase its concurrency some!
{quote}
Yes it's indeed capped at 4 threads by default, and the result was indeed 
impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, 
I got about 73% processing time reduction when using '-threadCount 12' versus 
sequential. To increase its concurrency for nightly benchmark, I assume a 
change can be made in 
[luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704]
 to pass in the flag? If so, I can open a PR for it as well!
{quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise 
the new {{-threadCount}} option.  [~zacharymorn] could you open a quick 
followup PR?  Thanks!
{quote}
Ah yes sorry for missing that. I've opened a PR for updating it 
https://github.com/apache/lucene/pull/281

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2021-09-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409182#comment-17409182
 ] 

Zach Chen commented on LUCENE-9662:
---

{quote}Of course, this is on [ridiculously concurrent (256 cores with 
hyperthreading) 
hardware|https://blog.mikemccandless.com/2021/01/apache-lucene-performance-on-128-core.html],
 but still it is only using the default 4 concurrent threads right?  I'll add 
an annotation, and increase its concurrency some!
{quote}
Yes it's indeed capped at 4 threads by default, and the result was indeed 
impressive with just a few more threads! On my not-so-fast 6 cores macbook pro, 
I got about 73% processing time reduction when using '-threadCount 12' versus 
sequential. To increase its concurrency for nightly benchmark, I assume a 
change can be made in 
[luceneutil|https://github.com/mikemccand/luceneutil/blob/0084387e001b426075eb828f43ad0c4e955e9280/src/python/nightlyBench.py#L695-L704]
 to pass in the flag? If so, I can open a PR for it as well!
{quote}Hmm, it looks like we didn't fix the {{Usage: ...}} output to advertise 
the new {{-threadCount}} option.  [~zacharymorn] could you open a quick 
followup PR?  Thanks!
{quote}
Ah yes sorry for missing that. I've opened a PR for updating it 
https://github.com/apache/lucene/pull/281

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors

2021-09-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409181#comment-17409181
 ] 

Zach Chen commented on LUCENE-9959:
---

Hi [~jpountz], sorry for the delay here, somehow I missed the update earlier. 
+1 for reverting the changes to unblock 9.0 release, I've created a PR here 
https://github.com/apache/lucene/pull/280

> Can we remove threadlocals of stored fields and term vectors
> 
>
> Key: LUCENE-9959
> URL: https://issues.apache.org/jira/browse/LUCENE-9959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> [~rmuir] suggested removing these threadlocals at 
> https://github.com/apache/lucene/pull/137#issuecomment-840111367.
> These threadlocals are trappy if you manage many segments and threads within 
> the same JVM, or worse: non-fixed threadpools. The challenge is to keep the 
> API easy to use.
> We could take advantage of 9.0 to change the stored fields API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10076) Luke test assertion failure from TestOverviewImpl

2021-08-28 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10076:
--

 Summary: Luke test assertion failure from TestOverviewImpl
 Key: LUCENE-10076
 URL: https://issues.apache.org/jira/browse/LUCENE-10076
 Project: Lucene - Core
  Issue Type: Task
  Components: luke
Reporter: Zach Chen


Found a test assertion error from main branch 
[head|https://github.com/apache/lucene/commit/e470535072edad13b994ded740bf60cd81f510ea]
   
{code:java}
org.apache.lucene.luke.models.overview.TestOverviewImpl > test suite's output 
saved to 
/Users/xichen/IdeaProjects/lucene/lucene/luke/build/test-results/test/outputs/OUTPUT-org.apache.lucene.luke.models.overview.TestOverviewImpl.txt,
 copied below:
  2> ERROR StatusLogger Could not reconfigure JMX
  2>  java.security.AccessControlException: access denied 
("javax.management.MBeanServerPermission" "createMBeanServer")
  2>    at 
java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
  2>    at 
java.base/java.security.AccessController.checkPermission(AccessController.java:897)
  2>    at 
java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:322)
  2>    at 
java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:479)
  2>    at 
org.apache.logging.log4j.core.jmx.Server.reregisterMBeansAfterReconfigure(Server.java:140)
  2>    at 
org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:629)
  2>    at 
org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:691)
  2>    at 
org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:708)
  2>    at 
org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:263)
  2>    at 
org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:153)
  2>    at 
org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:45)
  2>    at org.apache.logging.log4j.LogManager.getContext(LogManager.java:194)
  2>    at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:602)
  2>    at 
org.apache.lucene.luke.util.LoggerFactory.getLogger(LoggerFactory.java:68)
  2>    at 
org.apache.lucene.luke.models.util.IndexUtils.(IndexUtils.java:59)
  2>    at org.apache.lucene.luke.models.LukeModel.(LukeModel.java:60)
  2>    at 
org.apache.lucene.luke.models.overview.OverviewImpl.(OverviewImpl.java:49)
  2>    at 
org.apache.lucene.luke.models.overview.TestOverviewImpl.testIsOptimized(TestOverviewImpl.java:74)
  2>    at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  2>    at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  2>    at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  2>    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
  2>    at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  2>    at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  2>    at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  2>    at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  2>    at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  2>    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  2>    at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  2>    at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
  2>    at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
  2>    at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
  2>    at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
  2>    at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  2>    at 

[jira] [Created] (LUCENE-10074) Remove unneeded default value assignment

2021-08-26 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10074:
--

 Summary: Remove unneeded default value assignment
 Key: LUCENE-10074
 URL: https://issues.apache.org/jira/browse/LUCENE-10074
 Project: Lucene - Core
  Issue Type: Task
Reporter: Zach Chen


This is a spin-off issue from discussion here 
[https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
would like to see if there's any automatic checking mechanism (ecj ?) that can 
be enabled to detect and warn about unneeded default value assignments in 
future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10071) Review and refactor synchronization handling between MockDirectoryWrapper and CheckIndex

2021-08-25 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10071:
--

 Summary: Review and refactor synchronization handling between 
MockDirectoryWrapper and CheckIndex
 Key: LUCENE-10071
 URL: https://issues.apache.org/jira/browse/LUCENE-10071
 Project: Lucene - Core
  Issue Type: Task
  Components: core/index, modules/test-framework
Reporter: Zach Chen


This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/128,] as we noticed there's a subtle way 
to cause deadlock in test (or maybe even in production code if similar logic is 
implemented) [https://github.com/apache/lucene/pull/128#discussion_r642639399.] 

This issue is to review how synchronization can be improved between these 
classes to make it less deadlock-prone, or more explicit when locking 
arrangement needs to be made.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2021-08-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17399305#comment-17399305
 ] 

Zach Chen commented on LUCENE-10002:


{quote}Nice [~zacharymorn]! Quite a large change for sure! I took a look at the 
DrillSideways changes and they appear correct to me at first glance, but I'll 
see if I can spend more time going through the whole PR in the next couple of 
days.

In the meantime, I went ahead and spun off LUCENE-10050 to track making a 
similar API change to DrillSideways.
{quote}
Sounds great, thanks Greg!

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2021-08-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397827#comment-17397827
 ] 

Zach Chen commented on LUCENE-10002:


Hi [~jpountz] [~gsmiller], I have created a PR for this to deprecate the 
collector API in favor of the collector manager API, as well as some initial 
refactoring to some tests and the parts in DrillSideways that use 
TopScoreDocCollector & TopFieldCollector to use the latter API. I plan to 
submit more PRs afterward for other areas in the codebase.

Please note that I did first try to remove the collector API entirely, but that 
ended up resulting in way too many changes than I'm comfortable with in a 
single PR, and I also feel this API is such a commonly used one that users may 
prefer a more gradual deprecation / transition period. Hence I reverted my 
previous effort and adopted a phased approach.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors

2021-07-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386614#comment-17386614
 ] 

Zach Chen commented on LUCENE-9959:
---

{quote}I had put it on hold to see whether we should explore changing the API 
like you did rather than still caching stored fields readers per thread but 
removing as much state as possible like my PR does.
{quote}
I see. Thanks for the clarification!
{quote}If the new API proves controversial, I'd be open to an alternative that 
would consist of keeping the previous API and pulling a new TermVectorsReader 
(resp. StoredFieldsReader) internally every time that term vectors (resp. 
stored fields) are requested instead of the previous approach that consisted of 
caching instances in a threadlocal.
{quote}
+1.  Do we want to try this different approach for stored field, and see how it 
compares with the new API for term vector (which may create inconsistency 
between APIs for the two, but hopefully temporarily) ?

> Can we remove threadlocals of stored fields and term vectors
> 
>
> Key: LUCENE-9959
> URL: https://issues.apache.org/jira/browse/LUCENE-9959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> [~rmuir] suggested removing these threadlocals at 
> https://github.com/apache/lucene/pull/137#issuecomment-840111367.
> These threadlocals are trappy if you manage many segments and threads within 
> the same JVM, or worse: non-fixed threadpools. The challenge is to keep the 
> API easy to use.
> We could take advantage of 9.0 to change the stored fields API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors

2021-07-13 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379654#comment-17379654
 ] 

Zach Chen commented on LUCENE-9959:
---

Hi [~jpountz], I've merged the PR for term vectors thread local removal, and 
plan to take on the stored fields one next. I noticed your original PR 
[https://github.com/apache/lucene/pull/137] that led to this Jira and also 
touched on stored fields has not been merged yet, do you plan to merge it any 
time soon, or will you have more changes for it?

> Can we remove threadlocals of stored fields and term vectors
> 
>
> Key: LUCENE-9959
> URL: https://issues.apache.org/jira/browse/LUCENE-9959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> [~rmuir] suggested removing these threadlocals at 
> https://github.com/apache/lucene/pull/137#issuecomment-840111367.
> These threadlocals are trappy if you manage many segments and threads within 
> the same JVM, or worse: non-fixed threadpools. The challenge is to keep the 
> API easy to use.
> We could take advantage of 9.0 to change the stored fields API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10018) Remove Fields from TermVector reader related usage

2021-07-13 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379647#comment-17379647
 ] 

Zach Chen commented on LUCENE-10018:


Hi [~dsmiley], just to provide a quick update, I've merged the TermVectors PR 
for LUCENE-9959.  

> Remove Fields from TermVector reader related usage
> --
>
> Key: LUCENE-10018
> URL: https://issues.apache.org/jira/browse/LUCENE-10018
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/index
>Reporter: Zach Chen
>Assignee: David Smiley
>Priority: Minor
>
> This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for 
> Fields class deprecation / removal in TermVector reader usage. As Fields 
> class is generally meant as internal class reserved for posting index, we 
> would like to have some dedicated TermVector abstractions and APIs instead. 
> The relevant discussions are available here:
>  * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863254651]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863262562]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863775298]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-864720190]
>  * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-871155896]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-871922823]
>  
> One potential API design for this can be found here 
> [https://github.com/apache/lucene/pull/180#issuecomment-871155896] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10018) Remove Fields from TermVector reader related usage

2021-07-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373895#comment-17373895
 ] 

Zach Chen commented on LUCENE-10018:


Sounds good, thanks David!

> Remove Fields from TermVector reader related usage
> --
>
> Key: LUCENE-10018
> URL: https://issues.apache.org/jira/browse/LUCENE-10018
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs, core/index
>Reporter: Zach Chen
>Assignee: David Smiley
>Priority: Minor
>
> This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for 
> Fields class deprecation / removal in TermVector reader usage. As Fields 
> class is generally meant as internal class reserved for posting index, we 
> would like to have some dedicated TermVector abstractions and APIs instead. 
> The relevant discussions are available here:
>  * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863254651]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863262562]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-863775298]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-864720190]
>  * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-871155896]
>  * [https://github.com/apache/lucene/pull/180#issuecomment-871922823]
>  
> One potential API design for this can be found here 
> [https://github.com/apache/lucene/pull/180#issuecomment-871155896] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10018) Remove Fields from TermVector reader related usage

2021-07-01 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10018:
--

 Summary: Remove Fields from TermVector reader related usage
 Key: LUCENE-10018
 URL: https://issues.apache.org/jira/browse/LUCENE-10018
 Project: Lucene - Core
  Issue Type: Task
  Components: core/codecs, core/index
Reporter: Zach Chen


This is a spin-off issue from [https://github.com/apache/lucene/pull/180] for 
Fields class deprecation / removal in TermVector reader usage. As Fields class 
is generally meant as internal class reserved for posting index, we would like 
to have some dedicated TermVector abstractions and APIs instead. The relevant 
discussions are available here:
 * [https://github.com/apache/lucene/pull/180#pullrequestreview-686320076]
 * [https://github.com/apache/lucene/pull/180#issuecomment-863254651]
 * [https://github.com/apache/lucene/pull/180#issuecomment-863262562]
 * [https://github.com/apache/lucene/pull/180#issuecomment-863775298]
 * [https://github.com/apache/lucene/pull/180#issuecomment-864720190]
 * [https://github.com/apache/lucene/pull/180#pullrequestreview-688023901]
 * [https://github.com/apache/lucene/pull/180#issuecomment-871155896]
 * [https://github.com/apache/lucene/pull/180#issuecomment-871922823]

 

One potential API design for this can be found here 
[https://github.com/apache/lucene/pull/180#issuecomment-871155896] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9959) Can we remove threadlocals of stored fields and term vectors

2021-06-12 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362392#comment-17362392
 ] 

Zach Chen commented on LUCENE-9959:
---

I took a look at this issue and the idea suggested by Robert (and 
https://issues.apache.org/jira/browse/LUCENE-1195 that seems to introduce 
thread local originally), and gave it a try with this WIP PR 
[https://github.com/apache/lucene/pull/180] (with commit 
[https://github.com/apache/lucene/commit/5062e4d69938f104b461004022e19c10a65960a5]
 that has the most meaningful changes). Is the implementation what you are 
expecting? I feel since `IndexReader` already has APIs _getTermVectors_ and 
_getTermVector_, it might not be too bad to add a new API to go alongside with 
them, and gradually phased out the use of the existing two (at least for term 
vector)?

In addition, I'm a bit wondering why other readers from SegmentReader don't 
need to use the same thread local approach for concurrency / caching (namely, 
the PointsReader, NormsProducer, DocValuesProducer, VectorReader, 
FieldsProducer in SegmentReader). I'm guessing these readers' operations might 
be much less costly compared with term vector reader and stored field reader, 
so their operations are made thread-safe internally? I'll dig around to 
understand more about the context there...

> Can we remove threadlocals of stored fields and term vectors
> 
>
> Key: LUCENE-9959
> URL: https://issues.apache.org/jira/browse/LUCENE-9959
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> [~rmuir] suggested removing these threadlocals at 
> https://github.com/apache/lucene/pull/137#issuecomment-840111367.
> These threadlocals are trappy if you manage many segments and threads within 
> the same JVM, or worse: non-fixed threadpools. The challenge is to keep the 
> API easy to use.
> We could take advantage of 9.0 to change the stored fields API?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-09 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360555#comment-17360555
 ] 

Zach Chen commented on LUCENE-9976:
---

{quote}[~zacharymorn] I believe that the same problem exists on branch_8x and 
branch_8_9, let's backport your fix?
{quote}
Ah yes! I've opened two new PRs for backporting:
 # branch_8x: [https://github.com/apache/lucene-solr/pull/2512] (with a small 
comment) 
 # branch_8_9: [https://github.com/apache/lucene-solr/pull/2511]

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Zach Chen
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-09 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen resolved LUCENE-9976.
---
Resolution: Fixed

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Zach Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-09 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen reassigned LUCENE-9976:
-

Assignee: Zach Chen

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Assignee: Zach Chen
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-07 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359001#comment-17359001
 ] 

Zach Chen commented on LUCENE-9976:
---

No worry [~jpountz], and hope you had a great vacation! I'm looking forward to 
mine coming up in a few weeks! :D
{quote}It's a bit worrying that this bug only got caught by 
TestExpressionSorts, I wonder why the test cases we have in TestWANDScorer 
didn't catch it.
{quote}
That's a great call. I played around with the tests there a bit and came up 
with one new test that would fail around 80% of the time (not sure if there's 
clause ordering or other randomness kicked in) without the fix. From that, I 
think the _ConstantScoreQuery_ used heavily in those tests might have masked 
the issue a bit?

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-06 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358027#comment-17358027
 ] 

Zach Chen commented on LUCENE-9976:
---

For the time being, I've gone ahead and created a PR to update the assertion 
https://github.com/apache/lucene/pull/171

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-03 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357046#comment-17357046
 ] 

Zach Chen edited comment on LUCENE-9976 at 6/4/21, 4:13 AM:


{quote}I'm using mac, and trying with main branch head commit a6cf46dad
{quote}
Okay I should have also tried to pull the latest main branch before running the 
tests, and after that I'm also able to consistently reproduce this failure. 
Sorry for the confusion earlier!

The failure happened at this line: 
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code}
I reset the commits a few times to see where it started to fail, and believed 
it started from the performance regression fix commit 820e63d2ddf235c from 
https://issues.apache.org/jira/browse/LUCENE-9958 . The change was
{code:java}
diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java 
b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
index f33af6b8ee8..f5bab49fb71 100644
--- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
+++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
@@ -548,7 +548,7 @@ final class WANDScorer extends Scorer {
 
   /** Insert an entry in 'tail' and evict the least-costly scorer if full. */
   private DisiWrapper insertTailWithOverFlow(DisiWrapper s) {
-if (tailMaxScore + s.maxScore < minCompetitiveScore) {
+if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < 
minShouldMatch) {
   // we have free room for this new entry
   addTail(s);
   tailMaxScore += s.maxScore;
{code}
I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to 
happen now, since the block may be entered from condition _tailSize + 1 < 
minShouldMatch._ So the assertion logic should be updated to the following 
(tested locally and passed the test):
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || 
tailSize < minShouldMatch{code}
I can raise a quick PR if that looks good?  [~jpountz]


was (Author: zacharymorn):
{quote}I'm using mac, and trying with main branch head commit a6cf46dad
{quote}
Okay I should have also tried to pull the latest main branch before running the 
tests, and after that I'm also able to consistently reproduce this failure. 
Sorry for the confusion earlier!

The failure happened at this line: 
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code}
I reset the commits a few times to see where it started to fail, and believed 
it started from the performance regression fix commit 820e63d2ddf235c from 
https://issues.apache.org/jira/browse/LUCENE-9958 . The change was
{code:java}
diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java 
b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
index f33af6b8ee8..f5bab49fb71 100644
--- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
+++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
@@ -548,7 +548,7 @@ final class WANDScorer extends Scorer {
 
   /** Insert an entry in 'tail' and evict the least-costly scorer if full. */
   private DisiWrapper insertTailWithOverFlow(DisiWrapper s) {
-if (tailMaxScore + s.maxScore < minCompetitiveScore) {
+if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < 
minShouldMatch) {
   // we have free room for this new entry
   addTail(s);
   tailMaxScore += s.maxScore;
{code}
I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to 
happen now, since the block may be entered from condition _tailSize + 1 < 
minShouldMatch._ So the assertion logic should be updated to the following 
(tested locally and passed the test):

 
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || 
tailSize < minShouldMatch{code}
 

I can raise a quick PR if that looks good?  [~jpountz]

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-03 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357046#comment-17357046
 ] 

Zach Chen commented on LUCENE-9976:
---

{quote}I'm using mac, and trying with main branch head commit a6cf46dad
{quote}
Okay I should have also tried to pull the latest main branch before running the 
tests, and after that I'm also able to consistently reproduce this failure. 
Sorry for the confusion earlier!

The failure happened at this line: 
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore{code}
I reset the commits a few times to see where it started to fail, and believed 
it started from the performance regression fix commit 820e63d2ddf235c from 
https://issues.apache.org/jira/browse/LUCENE-9958 . The change was
{code:java}
diff --git a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java 
b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
index f33af6b8ee8..f5bab49fb71 100644
--- a/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
+++ b/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java
@@ -548,7 +548,7 @@ final class WANDScorer extends Scorer {
 
   /** Insert an entry in 'tail' and evict the least-costly scorer if full. */
   private DisiWrapper insertTailWithOverFlow(DisiWrapper s) {
-if (tailMaxScore + s.maxScore < minCompetitiveScore) {
+if (tailMaxScore + s.maxScore < minCompetitiveScore || tailSize + 1 < 
minShouldMatch) {
   // we have free room for this new entry
   addTail(s);
   tailMaxScore += s.maxScore;
{code}
I think from this logic, _tailMaxScore >= minCompetitiveScore_ is intended to 
happen now, since the block may be entered from condition _tailSize + 1 < 
minShouldMatch._ So the assertion logic should be updated to the following 
(tested locally and passed the test):

 
{code:java}
assert minCompetitiveScore == 0 || tailMaxScore < minCompetitiveScore || 
tailSize < minShouldMatch{code}
 

I can raise a quick PR if that looks good?  [~jpountz]

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-02 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356092#comment-17356092
 ] 

Zach Chen commented on LUCENE-9976:
---

Hi Dawid and Michael! I tried again with the command line above with 1000 
iterations, but it still didn't reproduce for me for some reasons.
{code:java}
xichen@Xis-MacBook-Pro lucene % ./gradlew test -Ptests.iters=1000 --tests 
TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.asserts=true -p lucene/expressions/
Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be 
reused, use --status for details


> Task :randomizationInfo
Running tests with randomization seed: tests.seed=FF571CE915A0955


> Task :lucene:expressions:test
:lucene:expressions:test (SUCCESS): 1000 test(s)
The slowest tests (exceeding 500 ms) during this run:
   6.62s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:159F353910AC3564]} (:lucene:expressions)
   6.56s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:993EFB36FB8A23F3]} (:lucene:expressions)
   6.22s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:C9E931CFB8A6C82E]} (:lucene:expressions)
   6.21s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:2854FA7396FAF62F]} (:lucene:expressions)
   5.84s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:5515E173B4FD16BA]} (:lucene:expressions)
   5.65s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:A8C1890BB457C90F]} (:lucene:expressions)
   5.62s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:A44F7F3F8B79B2DB]} (:lucene:expressions)
   5.57s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:328FA3364F99C839]} (:lucene:expressions)
   5.56s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:9D8BCE5B3371B6E2]} (:lucene:expressions)
   5.55s TestExpressionSorts.testQueries 
{seed=[FF571CE915A0955:2E635F6265446CED]} (:lucene:expressions)
The slowest suites (exceeding 1s) during this run:
  2662.21s TestExpressionSorts (:lucene:expressions)


BUILD SUCCESSFUL in 45m 1s{code}

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9976) WANDScorer assertion error in ensureConsistent

2021-06-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355476#comment-17355476
 ] 

Zach Chen commented on LUCENE-9976:
---

Hmm this test actually passed for me:
{code:java}
xichen@Xis-MacBook-Pro lucene % ./gradlew test --tests 
TestExpressionSorts.testQueries -Dtests.seed=FF571CE915A0955 
-Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true 
-Dtests.asserts=true -p lucene/expressions/
Starting a Gradle Daemon, 7 busy and 18 incompatible Daemons could not be 
reused, use --status for details


> Task :randomizationInfo
Running tests with randomization seed: tests.seed=FF571CE915A0955


BUILD SUCCESSFUL in 37s
{code}
I'm using mac, and trying with main branch head commit a6cf46dad

> WANDScorer assertion error in ensureConsistent
> --
>
> Key: LUCENE-9976
> URL: https://issues.apache.org/jira/browse/LUCENE-9976
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>
> Build fails and is reproducible:
> https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-main/283/console
> {code}
> ./gradlew test --tests TestExpressionSorts.testQueries 
> -Dtests.seed=FF571CE915A0955 -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true -Dtests.asserts=true -p lucene/expressions/
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9984) Make CheckIndex doChecksumsOnly / -fast as default

2021-05-31 Thread Zach Chen (Jira)
Zach Chen created LUCENE-9984:
-

 Summary: Make CheckIndex doChecksumsOnly / -fast as default 
 Key: LUCENE-9984
 URL: https://issues.apache.org/jira/browse/LUCENE-9984
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 9.0
Reporter: Zach Chen
Assignee: Zach Chen


This issue is a spin-off from discussion in 
https://github.com/apache/lucene/pull/128

Currently _CheckIndex_ defaults to checking both checksum as well as content 
inside each segment files for correctness, and requires _-fast_ flag to be 
explicitly passed in to do checksum only. However, this default setting was 
there due to lack of checksum feature historically, and is slow for most 
end-users nowadays as they probably only care about their indices being intact 
(from random bit flipping for example).

This issue is to change the default settings for CheckIndex so that they are 
more appropriate for end-users. One proposal from @rmuir is the following:
 # Make {{-fast}} the new default.
 # The previous {{-slow}} could be moved to {{-slower}} 
 # The current behavior (checksum + segment file content - slow check) could be 
activated by {{-slow}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349941#comment-17349941
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/23/21, 7:13 AM:
-

Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*
{code:java}
AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894{code}
*BMM Scorer*
{code:java}
AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253
{code}
I've also attached "JFR result for BMM scorer with optimizations May 22" for 
the BMM scorer profiling result from the latest changes. Overall, it seems that 
the larger number of docs collected by BMM is becoming a bottleneck for 
performance, as around 50% of the computation was spent by 
SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute 
score for candidate doc (around 34% of the computation was spent to find the 
next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more 
docs faster, it should be able to improve BMM further.


was (Author: zacharymorn):
Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*

 
{code:java}

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17349941#comment-17349941
 ] 

Zach Chen commented on LUCENE-9335:
---

Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*

 
{code:java}
AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894{code}
 

*BMM Scorer*

 
{code:java}
AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253
{code}
 

I've also attached "JFR result for BMM scorer with optimizations May 22" for 
the BMM scorer profiling result from the latest changes. Overall, it seems that 
the larger number of docs collected by BMM is becoming a bottleneck for 
performance, as around 50% of the computation was spent by 
SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute 
score for candidate doc (around 34% of the computation was spent to find the 
next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more 
docs faster, it should be able to improve BMM further.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: JFR result for BMM scorer with optimizations May 22.png, 
> MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-9335:
--
Attachment: JFR result for BMM scorer with optimizations May 22.png

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: JFR result for BMM scorer with optimizations May 22.png, 
> MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348046#comment-17348046
 ] 

Zach Chen commented on LUCENE-9335:
---

{quote}Actually this matches my expectation. BMM and BMW differ in that BMM 
only makes a decision about which scorers lead iteration once per block, while 
BMW needs to make decisions on every document. So BMM collects more documents 
than BMW but BMW takes the risk that trying to be too smart makes things slower 
than a simpler approach.
{quote}
Ok I also took a further look at the TopDocsCollector code, and confirmed that 
I had an incorrect understanding of "collect" and "hit count" here earlier. 
This (and Michael's earlier response) totally makes sense now!
{quote}Yes. You can download the "Collection" and "Queries" files from 
[https://microsoft.github.io/msmarco/#ranking] (make sure to accept terms at 
the top first so that download links are active).
{quote}
Thanks! I was able to download them. Will explore a bit more to see how they 
can be improved further.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM:
-

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
*BMMBulkScorer without window (with the above scorer implementation)*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024* 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 
some 

[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM:
-

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
  

*BMMBulkScorer without window (with the above scorer implementation)*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024* 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen commented on LUCENE-9335:
---

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
 

 

*BMMBulkScorer without window (with the above scorer implementation)*

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024*

 

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 
some low percentile (P10 for example) 

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-17 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346597#comment-17346597
 ] 

Zach Chen commented on LUCENE-9335:
---

[~jpountz] what do you think about the results we got so far? If we are good 
with the trade-off and performance improvement BMM has for _OrHighHigh_ and __ 
_OrHighMed_ queries, I can work on productizing the changes next.

 

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-16 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345793#comment-17345793
 ] 

Zach Chen commented on LUCENE-9335:
---

I made some changes to the BulkScorer implementations to return false for BMM 
eligibility immediately when non term query was identified, and they improved 
the benchmark results for Fuzzy1 & Fuzzy2 a bit 
([https://github.com/apache/lucene/pull/113/commits/f4115f78be0833b65694ad6a0f9f4f32565091e7).]
 However, it appears that Fuzzy1 & Fuzzy2 benchmark results would vary more in 
general across runs / queries used compared to other tasks.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345575#comment-17345575
 ] 

Zach Chen commented on LUCENE-9335:
---

I see why Fuzzy1 & Fuzzy2 did not trigger BMM scorer / bulkScorer now. Those 
queries were rewritten into boolean queries with boosting (BoostQuery), but in 
the BMM eligibility check I had check for TermQuery directly 
[https://github.com/apache/lucene/pull/113/files#diff-d500c30048128831b0fe3c53d9bb74eed7d8063e81d33737b26dcd00bc7f1fd2R337]
 , hence the BMM scorer / bulkScorer were not invoked for them.

Also likely the looping in that check hurt performance for both 
implementations, as fuzzy queries can expand into ones with many subqueries 
(one instance I saw was 50 subqueries), and the current logic would go through 
all subqueries. 

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345564#comment-17345564
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/15/21, 6:57 PM:
-

{quote}Are you sure? I believe that fuzzy queries rewrite to boolean queries, 
so they would use your new block-max maxscore under the hood?
{quote}
Hmm I verified that by throwing runtime exception in the BMM BulkScorer's 
constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which 
completed successfully. I feel the slow down may come from the checks to see if 
BMM is applicable. Let me take a further look there.


was (Author: zacharymorn):
> Are you sure? I believe that fuzzy queries rewrite to boolean queries, so 
>they would use your new block-max maxscore under the hood?

Hmm I verified that by throwing runtime exception in the BMM BulkScorer's 
constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which 
completed successfully. I feel the slow down may come from the checks to see if 
BMM is applicable. Let me take a further look there.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345564#comment-17345564
 ] 

Zach Chen commented on LUCENE-9335:
---

> Are you sure? I believe that fuzzy queries rewrite to boolean queries, so 
>they would use your new block-max maxscore under the hood?

Hmm I verified that by throwing runtime exception in the BMM BulkScorer's 
constructor, and running only Fuzz1 & Fuzz2 queries in the benchmark, which 
completed successfully. I feel the slow down may come from the checks to see if 
BMM is applicable. Let me take a further look there.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344973#comment-17344973
 ] 

Zach Chen commented on LUCENE-9335:
---

Just want to provide a quick summary of the latest progress of this issue. 
Currently there are 3 different BMM implementations from 2 PRs:
 # Scorer based implementation
 ## PR: [https://github.com/apache/lucene/pull/101] 
 ## wikibigall benchmark results: 
[https://github.com/apache/lucene/pull/101#issuecomment-840255508]
 ### On average it improves _OrHighHigh_ by 40%+, and _OrHighMed_ around 20%
 ### 1 out of 3 runs it hurt _AndMedOrHighHigh_ and _OrHighMed_ performance by 
around 16%
 # BulkScorer based implementation with fixed window size
 ## PR: [https://github.com/apache/lucene/pull/113] 
 ## wikibigall benchmark with window size 1024 results: 
[https://github.com/apache/lucene/pull/113#issuecomment-840293637]
 ### On average it improves _OrHighHigh_ by 3-8%, and _OrHighMed_ by 23%+
 ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance by around 8%, even 
though it wasn't used for those queries 
 # BulkScorer based implementation without window, and using the scorer 
implementation from #1
 ## Commit: 
[https://github.com/zacharymorn/lucene/commit/3bcdbb31a7d55b00cb53e4be40a4adc93b9f30db]
 
 ## wikibigall benchmark results: 
[https://github.com/apache/lucene/pull/113#discussion_r631568912]
 ### On average it improves _OrHighHigh by 52%, and_ _OrHighMed 10% - 18%_
 ### For some reasons it hurt Fuzzy1 & Fuzzy2 performance consistently by 
around 8%-13%, even though it wasn't used for those queries 

[~jpountz] what do you think about the above results as well as the latest 
changes, and any other idea we would like to try on? From the current results 
it appears option 1 might be the one to go with? I can start to work on 
productizing the changes and adding tests if we have settled down on the 
implementation approach here.

 

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9662) CheckIndex should be concurrent

2021-05-06 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340017#comment-17340017
 ] 

Zach Chen commented on LUCENE-9662:
---

Hi [~mikemccand], I've taken a stab at this and created a WIP PR 
[https://github.com/apache/lucene/pull/128] with some nocommit comments. Could 
you please take a look and let me know your thoughts?

> CheckIndex should be concurrent
> ---
>
> Key: LUCENE-9662
> URL: https://issues.apache.org/jira/browse/LUCENE-9662
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Priority: Major
>
> I am watching a nightly benchmark run slowly run its {{CheckIndex}} step, 
> using a single core out of the 128 cores the box has.
> It seems like this is an embarrassingly parallel problem, if the index has 
> multiple segments, and would finish much more quickly on concurrent hardware 
> if we did "thread per segment".
> If wanted to get even further concurrency, each part of the Lucene index that 
> is checked is also independent, so it could be "thread per segment per part".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340005#comment-17340005
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/6/21, 5:40 AM:


No problem! Writing these scorers has actually been a great exercise for me to 
understand more on the scoring related APIs and benchmark testing. I have 
enjoyed it a lot!

For the profiling, are you referring to JFR? It is currently enabled by default 
in luceneutil and I've added the result below from 5 "Med" terms queries 
(queries file _wikimedium.10M.nostopwords.tasks.5OrMeds_ attached) :

*BMM Scorer Run 1*
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
       OrMedMedMedMedMed       40.66      (9.7%)       32.77      (7.7%)  
-19.4% ( -33% -   -2%) 0.000
                PKLookup      215.12      (1.5%)      221.56      (1.8%)    
3.0% (   0% -    6%) 0.000
{code}
CPU merged search profile for my_modified_version: 
{code:java}
PROFILE SUMMARY from 12153 events (total: 12153)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
4.24%         515           
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
4.22%         513           
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore()
3.11%         378           java.util.LinkedList#listIterator()
2.53%         307           java.util.LinkedList$ListItr#next()
2.42%         294           java.util.zip.Inflater#inflateBytesBytes()
2.15%         261           org.apache.lucene.search.DisiPriorityQueue#pop()
1.60%         195           jdk.internal.misc.Unsafe#getByte()
1.47%         179           
org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.41%         171           java.util.AbstractList$SubList#listIterator()
1.36%         165           java.util.AbstractList#listIterator()
1.31%         159           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.22%         148           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.21%         147           org.apache.lucene.search.DisiPriorityQueue#add()
1.20%         146           java.util.LinkedList$ListItr#()
1.15%         140           
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.11%         135           
java.util.LinkedList$ListItr#checkForComodification()
1.08%         131           
java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04%         126           java.nio.DirectByteBuffer#get()
1.00%         122           java.lang.Object#wait()
1.00%         122           
org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.95%         116           java.util.AbstractList$Itr#()
0.82%         100           
java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.81%         98            org.apache.lucene.store.ByteBufferGuard#getByte()
0.80%         97            
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.73%         89            sun.nio.fs.UnixNativeDispatcher#open0()
0.73%         89            java.lang.ClassLoader#defineClass1()
0.72%         87            
java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.69%         84            
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.63%         77            jdk.internal.util.ArraysSupport#mismatch()
0.63%         76            
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists()
{code}
CPU merged search profile for baseline: 
{code:java}
PROFILE SUMMARY from 9671 events (total: 9671)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
2.96%         286           java.util.zip.Inflater#inflateBytesBytes()
1.78%         172           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.73%         167           
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.72%         166           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.57%         152           java.lang.Object#wait()
1.57%         152           org.apache.lucene.search.DisiPriorityQueue#add()
1.51%         146           
org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.43%         138           java.nio.DirectByteBuffer#get()
1.35%         131           
java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.15%         111           java.io.RandomAccessFile#readBytes()
1.10%         106           java.lang.ClassLoader#defineClass1()
1.07%         103           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.01%         98            

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-05 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340005#comment-17340005
 ] 

Zach Chen commented on LUCENE-9335:
---

No problem! Writing these scorers has actually been a great exercise for me to 
understand more on the scoring related APIs and benchmark testing. I have 
enjoyed it a lot!

For the profiling, are you referring to JFR? It is currently enabled by default 
in luceneutil and I've added the result below from 5 "Med" terms queries 
(queries file _wikimedium.10M.nostopwords.tasks.5OrMeds_ attached) :

*BMM Scorer Run 1*
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
       OrMedMedMedMedMed       40.66      (9.7%)       32.77      (7.7%)  
-19.4% ( -33% -   -2%) 0.000
                PKLookup      215.12      (1.5%)      221.56      (1.8%)    
3.0% (   0% -    6%) 0.000
{code}
CPU merged search profile for my_modified_version:

 
{code:java}
PROFILE SUMMARY from 12153 events (total: 12153)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
4.24%         515           
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#doAdvance()
4.22%         513           
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#updateUpToAndMaxScore()
3.11%         378           java.util.LinkedList#listIterator()
2.53%         307           java.util.LinkedList$ListItr#next()
2.42%         294           java.util.zip.Inflater#inflateBytesBytes()
2.15%         261           org.apache.lucene.search.DisiPriorityQueue#pop()
1.60%         195           jdk.internal.misc.Unsafe#getByte()
1.47%         179           
org.apache.lucene.search.BlockMaxMaxscoreScorer$2#matches()
1.41%         171           java.util.AbstractList$SubList#listIterator()
1.36%         165           java.util.AbstractList#listIterator()
1.31%         159           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.22%         148           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.21%         147           org.apache.lucene.search.DisiPriorityQueue#add()
1.20%         146           java.util.LinkedList$ListItr#()
1.15%         140           
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.11%         135           
java.util.LinkedList$ListItr#checkForComodification()
1.08%         131           
java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.04%         126           java.nio.DirectByteBuffer#get()
1.00%         122           java.lang.Object#wait()
1.00%         122           
org.apache.lucene.search.DisiPriorityQueue#downHeap()
0.95%         116           java.util.AbstractList$Itr#()
0.82%         100           
java.util.regex.Pattern$BmpCharPredicate$$Lambda$103.530539368#is()
0.81%         98            org.apache.lucene.store.ByteBufferGuard#getByte()
0.80%         97            
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.73%         89            sun.nio.fs.UnixNativeDispatcher#open0()
0.73%         89            java.lang.ClassLoader#defineClass1()
0.72%         87            
java.lang.invoke.InvokerBytecodeGenerator#emitImplicitConversion()
0.69%         84            
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#loadBlock()
0.63%         77            jdk.internal.util.ArraysSupport#mismatch()
0.63%         76            
org.apache.lucene.search.BlockMaxMaxscoreScorer$1#repartitionLists()
{code}
CPU merged search profile for baseline:

 

 
{code:java}
PROFILE SUMMARY from 9671 events (total: 9671)
  tests.profile.mode=cpu
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       CPU SAMPLES   STACK
2.96%         286           java.util.zip.Inflater#inflateBytesBytes()
1.78%         172           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.73%         167           
org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#seekExact()
1.72%         166           org.apache.lucene.search.DisiPriorityQueue#upHeap()
1.57%         152           java.lang.Object#wait()
1.57%         152           org.apache.lucene.search.DisiPriorityQueue#add()
1.51%         146           
org.apache.lucene.search.DisiPriorityQueue#downHeap()
1.43%         138           java.nio.DirectByteBuffer#get()
1.35%         131           
java.lang.invoke.InvokerBytecodeGenerator#isStaticallyInvocable()
1.15%         111           java.io.RandomAccessFile#readBytes()
1.10%         106           java.lang.ClassLoader#defineClass1()
1.07%         103           
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.01%         98            org.apache.lucene.store.ByteBufferGuard#getByte()
1.01%         98            

[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-05 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-9335:
--
Attachment: wikimedium.10M.nostopwords.tasks.5OrMeds

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337948#comment-17337948
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/2/21, 4:16 AM:


I was trying to modify the _CreateQueries_ class in luceneutil to generate OR 
queries with 5 clauses, but got some issues running it. So I did some quick 
hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create 
a new OrHighHighMedHighLow task with queries. I've attached the resulting file 
_wikimedium.10M.nostopwords.tasks_ to this ticket. 

Here are the luceneutil results from 2 runs for each implementation:

Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
                   TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  
-19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    
2.0% (  -3% -    8%) 0.037{code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     
OrHighHighMedHighLow       32.83      (3.4%)       34.00      (5.1%)    
3.6% (  -4% -   12%) 0.009                 
PKLookup      217.86      (2.8%)      228.14      (4.2%)    
4.7% (  -2% -   12%) 0.000
{code}
BulkScorer 
[https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    
5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   
10.1% ( -14% -   44%) 0.020 {code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     
OrHighHighMedHighLow       28.61      (5.4%)       22.28      (4.2%)  
-22.1% ( -30% -  -13%) 0.000                 
PKLookup      227.38      (2.6%)      233.05      (2.7%)    
2.5% (  -2% -    8%) 0.003
{code}
 


was (Author: zacharymorn):
I was trying to modify the _CreateQueries_ class in luceneutil to generate OR 
queries with 5 clauses, but got some issues running it. So I did some quick 
hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create 
a new OrHighHighMedHighLow task with queries. I've attached the resulting file 
_wikimedium.10M.nostopwords.tasks_ to this ticket. 

Here are the luceneutil results from 2 runs for each implementation:

Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
                   TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  
-19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    
2.0% (  -3% -    8%) 0.037{code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     OrHighHighMedHighLow       32.83     
 (3.4%)       34.00      (5.1%)    3.6% (  -4% -   12%) 0.009                 
PKLookup      217.86      (2.8%)      228.14      (4.2%)    4.7% (  -2% -   
12%) 0.000
{code}
BulkScorer 
[https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    
5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   
10.1% ( -14% -   44%) 0.020 {code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     OrHighHighMedHighLow       28.61     
 (5.4%)       22.28      (4.2%)  -22.1% ( -30% -  -13%) 0.000                 
PKLookup      227.38      (2.6%)      233.05      (2.7%)    2.5% (  -2% -    
8%) 0.003
{code}
 

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> 

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337948#comment-17337948
 ] 

Zach Chen commented on LUCENE-9335:
---

I was trying to modify the _CreateQueries_ class in luceneutil to generate OR 
queries with 5 clauses, but got some issues running it. So I did some quick 
hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create 
a new OrHighHighMedHighLow task with queries. I've attached the resulting file 
_wikimedium.10M.nostopwords.tasks_ to this ticket. 

Here are the luceneutil results from 2 runs for each implementation:

Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
                   TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  
-19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    
2.0% (  -3% -    8%) 0.037{code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     OrHighHighMedHighLow       32.83     
 (3.4%)       34.00      (5.1%)    3.6% (  -4% -   12%) 0.009                 
PKLookup      217.86      (2.8%)      228.14      (4.2%)    4.7% (  -2% -   
12%) 0.000
{code}
BulkScorer 
[https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    
5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   
10.1% ( -14% -   44%) 0.020 {code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value     OrHighHighMedHighLow       28.61     
 (5.4%)       22.28      (4.2%)  -22.1% ( -30% -  -13%) 0.000                 
PKLookup      227.38      (2.6%)      233.05      (2.7%)    2.5% (  -2% -    
8%) 0.003
{code}
 

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-01 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-9335:
--
Attachment: wikimedium.10M.nostopwords.tasks

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-01 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337742#comment-17337742
 ] 

Zach Chen commented on LUCENE-9335:
---

Hi [~jpountz], I've done another pass and fixed a few issues in 
[https://github.com/apache/lucene/pull/101]. I tried some other optimizations 
as well (such as moving scorer from essential to non-essential list every time 
minCompetitiveScore gets updated), but they didn't seems to improve the 
benchmark results much for pure disjunction queries in both implementations. 
Assuming there's no major miss / bug in the two implementations so far, I also 
feel that compared with BMW, the main bottleneck in BMM for 2-clause OR queries 
run by the benchmark is indeed the additional frequent operations performed to 
check and align on the max score boundary.

 

What do you think? Do you have any suggestion where I should look next?

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-28 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335125#comment-17335125
 ] 

Zach Chen commented on LUCENE-9335:
---

I've implemented the above strategy and opened a new PR for it 
[https://github.com/apache/lucene/pull/113.] I was using a _BulkScorer_ on top 
of a collection of _Scorers_ though, instead of a _BulkScorer_ on top of a 
collection of _BulkScorers_ like _BooleanScorer_, and hope the difference is 
due to algorithm difference rather than me misunderstanding the intended usage 
of BulkScorer interface :D . The result from benchmark util still shows it's 
slower than _WANDScorer_ for 2 clauses queries, especially for OrHighHigh task.

 

During the implementation of this BulkScorer I also realized there were some 
issues with the other PR I published earlier, so I'll fix them next and see if 
that will give us better result.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-22 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327129#comment-17327129
 ] 

Zach Chen commented on LUCENE-9335:
---

Makes sense. I guess the general strategy then would be to implement BMM in the 
BulkScorer, and do the maxScore initialization and essential / non-essential 
lists partition once and valid only within that 2048 documents boundary. I'll 
give that a try!

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-21 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326323#comment-17326323
 ] 

Zach Chen commented on LUCENE-9335:
---

Hi [~jpountz], I took a stab at implementing BMM and published a new PR here 
for further discussion [https://github.com/apache/lucene/pull/101] . I'm pretty 
happy about being able to implement a new scorer, even though its performance 
is a bit poor (although seems to be on par with the experiment result published 
in [http://engineering.nyu.edu/~suel/papers/bmm.pdf] for BMM and BMW comparison 
for 2-clause OR query). Shall we consider adding benchmark query set with 5+ 
clauses to see the performance comparison, as that seems to be when BMM may 
outperform BMW as the paper suggested?

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-13 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320681#comment-17320681
 ] 

Zach Chen commented on LUCENE-9335:
---

bq. Actually you should be able to do it without modifying the benchmarking 
code, by configuring your Competition object to not verify counts like that in 
your localrun file: {{comp = competition.Competition(verifyCounts=False)}}

Ah I see. Thanks for the tip, will use that going forward!

bq. Indeed this indicates that the query returns different top hits with your 
change. If the change was in the order of one ulp, then this could be due to 
the fact that the sum might depend on the order in which clauses' scores are 
summed up, but given the significant score difference, there must be a bigger 
problem. Have you run tests with this change? This could help figure out where 
the bug is.

Yes the *./gradlew check* was passing before,  but I saw your comment in PR and 
that calculation was indeed incorrect. Let me correct that and try again.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-13 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319923#comment-17319923
 ] 

Zach Chen edited comment on LUCENE-9335 at 4/13/21, 6:08 AM:
-

I made some further changes to move some block max related logic from 
DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can 
inherit. I've published a WIP PR [https://github.com/apache/lucene/pull/81] for 
those changes for the ease of review. 

When I run luceneutil, I see further errors from verifyScores section of code, 
which may indicate bugs in my changes:
{code:java}
WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+
Traceback (most recent call last):
  File "src/python/localrun.py", line 53, in 
    comp.benchmark("baseline_vs_patch")
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", 
line 455, in benchmark
    randomSeed = self.randomSeed)
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", 
line 196, in run
    raise RuntimeError('errors occurred: %s' % str(cmpDiffs))
RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None 
sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value 
([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0){code}
 

I then made further changes in benchUtil.py to skip over verifyScores, so that 
I can see what benchmark results it would generate: 
{code:java}
diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..c2faffc 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
@@ -1203,7 +1203,7 @@ class RunAlgs:
     cmpRawResults, heapCmp = parseResults(cmpLogFiles)
 
     # make sure they got identical results
-    cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, 
self.verifyCounts)
+    cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False)
 
     baseResults = collateResults(baseRawResults)
     cmpResults = collateResults(cmpRawResults){code}
 

I then got the following benchmark results from multiple runs
{code:java}
                  TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
               OrHighMed      186.44      (2.8%)      160.50      (4.5%)  
-13.9% ( -20% -   -6%) 0.000
               OrHighLow      735.70      (7.5%)      696.89      (4.3%)   
-5.3% ( -15% -    6%) 0.006
                  Fuzzy1       75.85     (11.5%)       72.81     (14.0%)   
-4.0% ( -26% -   24%) 0.323
              TermDTSort      237.49     (10.4%)      228.02     (10.6%)   
-4.0% ( -22% -   18%) 0.230
       HighTermMonthSort      280.82      (9.8%)      274.90     (10.8%)   
-2.1% ( -20% -   20%) 0.518
                  Fuzzy2       54.08     (12.5%)       53.04     (14.2%)   
-1.9% ( -25% -   28%) 0.648
            OrNotHighMed      672.83      (2.7%)      661.16      (4.7%)   
-1.7% (  -8% -    5%) 0.153
    HighTermTitleBDVSort      438.56     (14.4%)      431.81     (16.6%)   
-1.5% ( -28% -   34%) 0.754
              AndHighLow      969.43      (5.2%)      957.49      (4.7%)   
-1.2% ( -10% -    9%) 0.432
           OrNotHighHigh      704.98      (3.4%)      700.72      (3.9%)   
-0.6% (  -7% -    7%) 0.605
             AndHighHigh      109.77      (4.2%)      109.31      (4.7%)   
-0.4% (  -9% -    8%) 0.767
   BrowseMonthSSDVFacets       32.52      (2.1%)       32.40      (4.6%)   
-0.4% (  -6% -    6%) 0.755
                PKLookup      219.90      (3.1%)      219.16      (3.2%)   
-0.3% (  -6% -    6%) 0.734
                Wildcard      284.84      (1.9%)      284.18      (1.8%)   
-0.2% (  -3% -    3%) 0.690
                 Prefix3      361.00      (2.1%)      360.24      (2.0%)   
-0.2% (  -4% -    4%) 0.750
    HighIntervalsOrdered       28.68      (2.2%)       28.64      (1.7%)   
-0.1% (  -3% -    3%) 0.819
   BrowseMonthTaxoFacets       13.60      (2.9%)       13.59      (2.7%)   
-0.1% (  -5% -    5%) 0.947
BrowseDayOfYearSSDVFacets       28.67      (4.8%)       28.66      (4.8%)   
-0.0% (  -9% -   10%) 0.979
            HighSpanNear       79.29      (2.4%)       79.29      (2.2%)    
0.0% (  -4% -    4%) 0.997
           OrHighNotHigh      695.37      (5.5%)      696.65      (3.8%)    
0.2% (  -8% -   10%) 0.903
                 MedTerm     1478.47      (3.6%)     1481.54      (3.0%)    
0.2% (  -6% -    7%) 0.843
   HighTermDayOfYearSort      372.12     (14.1%)      373.08     (14.8%)    
0.3% ( -25% -   33%) 0.955
                  IntNRQ      125.36      (1.3%)      125.72      (0.7%)    
0.3% (  -1% -    2%) 0.391
             LowSpanNear       52.82      (1.7%)       52.98      (2.0%)    
0.3% (  -3% -    4%) 0.611
BrowseDayOfYearTaxoFacets       11.28      (3.1%)       11.31      (3.1%)    
0.3% (  -5% -    6%) 0.756
         LowSloppyPhrase      154.42      (2.9%)      154.91      (2.9%)    
0.3% (  -5% -    6%) 0.731
    

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-13 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319923#comment-17319923
 ] 

Zach Chen commented on LUCENE-9335:
---

I made some further changes to move some block max related logic from 
DisjunctionMaxScorer to DisjunctionScorer, so that DisjunctionSumScorer can 
inherit. I've published a WIP PR [https://github.com/apache/lucene/pull/81] for 
those changes for the ease of review. 

When I run luceneutil, I see further errors from verifyScores section of code, 
which may indicate bugs in my changes:

 
{code:java}
WARNING: cat=OrHighHigh: hit counts differ: 9870+ vs 2616+
Traceback (most recent call last):
  File "src/python/localrun.py", line 53, in 
    comp.benchmark("baseline_vs_patch")
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/competition.py", 
line 455, in benchmark
    randomSeed = self.randomSeed)
  File "/Users/xichen/IdeaProjects/benchmarks/util/src/python/searchBench.py", 
line 196, in run
    raise RuntimeError('errors occurred: %s' % str(cmpDiffs))
RuntimeError: errors occurred: ([], ["query=body:second body:short filter=None 
sort=None groupField=None hitCount=9870+: hit 0 has wrong field/score value 
([1444649], '5.0718417') vs ([5125], '4.224689')"], 1.0){code}
 

 

I then made further changes in benchUtil.py to skip over verifyScores, so that 
I can see what benchmark results it would generate:

 
{code:java}
diff --git a/src/python/benchUtil.py b/src/python/benchUtil.py
index fb50033..c2faffc 100644
--- a/src/python/benchUtil.py
+++ b/src/python/benchUtil.py
@@ -1203,7 +1203,7 @@ class RunAlgs:
     cmpRawResults, heapCmp = parseResults(cmpLogFiles)
 
     # make sure they got identical results
-    cmpDiffs = compareHits(baseRawResults, cmpRawResults, self.verifyScores, 
self.verifyCounts)
+    cmpDiffs = compareHits(baseRawResults, cmpRawResults, False, False)
 
     baseResults = collateResults(baseRawResults)
     cmpResults = collateResults(cmpRawResults){code}
 

 

 I then got the following benchmark results from multiple runs

 
{code:java}
                  TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
               OrHighMed      186.44      (2.8%)      160.50      (4.5%)  
-13.9% ( -20% -   -6%) 0.000
               OrHighLow      735.70      (7.5%)      696.89      (4.3%)   
-5.3% ( -15% -    6%) 0.006
                  Fuzzy1       75.85     (11.5%)       72.81     (14.0%)   
-4.0% ( -26% -   24%) 0.323
              TermDTSort      237.49     (10.4%)      228.02     (10.6%)   
-4.0% ( -22% -   18%) 0.230
       HighTermMonthSort      280.82      (9.8%)      274.90     (10.8%)   
-2.1% ( -20% -   20%) 0.518
                  Fuzzy2       54.08     (12.5%)       53.04     (14.2%)   
-1.9% ( -25% -   28%) 0.648
            OrNotHighMed      672.83      (2.7%)      661.16      (4.7%)   
-1.7% (  -8% -    5%) 0.153
    HighTermTitleBDVSort      438.56     (14.4%)      431.81     (16.6%)   
-1.5% ( -28% -   34%) 0.754
              AndHighLow      969.43      (5.2%)      957.49      (4.7%)   
-1.2% ( -10% -    9%) 0.432
           OrNotHighHigh      704.98      (3.4%)      700.72      (3.9%)   
-0.6% (  -7% -    7%) 0.605
             AndHighHigh      109.77      (4.2%)      109.31      (4.7%)   
-0.4% (  -9% -    8%) 0.767
   BrowseMonthSSDVFacets       32.52      (2.1%)       32.40      (4.6%)   
-0.4% (  -6% -    6%) 0.755
                PKLookup      219.90      (3.1%)      219.16      (3.2%)   
-0.3% (  -6% -    6%) 0.734
                Wildcard      284.84      (1.9%)      284.18      (1.8%)   
-0.2% (  -3% -    3%) 0.690
                 Prefix3      361.00      (2.1%)      360.24      (2.0%)   
-0.2% (  -4% -    4%) 0.750
    HighIntervalsOrdered       28.68      (2.2%)       28.64      (1.7%)   
-0.1% (  -3% -    3%) 0.819
   BrowseMonthTaxoFacets       13.60      (2.9%)       13.59      (2.7%)   
-0.1% (  -5% -    5%) 0.947
BrowseDayOfYearSSDVFacets       28.67      (4.8%)       28.66      (4.8%)   
-0.0% (  -9% -   10%) 0.979
            HighSpanNear       79.29      (2.4%)       79.29      (2.2%)    
0.0% (  -4% -    4%) 0.997
           OrHighNotHigh      695.37      (5.5%)      696.65      (3.8%)    
0.2% (  -8% -   10%) 0.903
                 MedTerm     1478.47      (3.6%)     1481.54      (3.0%)    
0.2% (  -6% -    7%) 0.843
   HighTermDayOfYearSort      372.12     (14.1%)      373.08     (14.8%)    
0.3% ( -25% -   33%) 0.955
                  IntNRQ      125.36      (1.3%)      125.72      (0.7%)    
0.3% (  -1% -    2%) 0.391
             LowSpanNear       52.82      (1.7%)       52.98      (2.0%)    
0.3% (  -3% -    4%) 0.611
BrowseDayOfYearTaxoFacets       11.28      (3.1%)       11.31      (3.1%)    
0.3% (  -5% -    6%) 0.756
         LowSloppyPhrase      154.42      (2.9%)      154.91      (2.9%)    
0.3% (  -5% -    6%) 0.731
               MedPhrase      143.27    

  1   2   >