[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM:
--

{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it doesn't get written to BooleanQuery 
for most of the 
time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!


was (Author: zacharymorn):
{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is not used there. In addition, the PR is already doing 
Maxscore-like calculation based on competitive impacts to skip docs. Am I 
missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-08 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440873#comment-17440873
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/9/21, 3:54 AM:
--

{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it very much doesn't get re-written to 
BooleanQuery|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!


was (Author: zacharymorn):
{quote}Thanks for exploring this area [~zacharymorn]!
{quote}
No problem, I'm always interested in exploring and learning about lucene 
querying!
{quote}I wonder if LUCENE-9335 could be helpful to reduce the overhead of 
pruning, since Maxscore tends to be have lower overhead than WAND.
{quote}
I think in my current understanding and testing of CombinedFieldQuery, 
WANDScorer is actually not used there ([it doesn't get written to BooleanQuery 
for most of the 
time|https://github.com/apache/lucene/blob/ded77d8bfdcdbf7cc2547e67833434a56f2edd16/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L256-L261]).
 In addition, the PR is already doing Maxscore-like calculation based on 
competitive impacts to skip docs. Am I missing anything here?
{quote}I see that you tested with 4 and 2 as boost values. I wonder if it makes 
a difference if you try out e.g. 20 and 1 instead. I just looked again at table 
3.1 on 
[https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf] and 
the optimal weights that they found for title/body were 38.4/1 on one dataset 
and 13.5/1 on another dataset.
{quote}
Sounds good will give that a try!

> CombinedFieldsQuery needs dynamic pruning support
> -
>
> Key: LUCENE-10061
> URL: https://issues.apache.org/jira/browse/LUCENE-10061
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: CombinedFieldQueryTasks.wikimedium.10M.nostopwords.tasks
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> CombinedFieldQuery's Scorer doesn't implement advanceShallow/getMaxScore, 
> forcing Lucene to collect all matches in order to figure the top-k hits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10061) CombinedFieldsQuery needs dynamic pruning support

2021-11-04 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439028#comment-17439028
 ] 

Zach Chen edited comment on LUCENE-10061 at 11/5/21, 4:50 AM:
--

Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following performance test results (perf test index rebuilt to 
enable norm for title field, task file attached, and luceneutil integration 
available at 
[https://github.com/mikemccand/luceneutil/pull/148):|https://github.com/mikemccand/luceneutil/pull/148:]
{code:java}
 # Run 1
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        4.64      (6.5%)        3.30      (4.7%)  -29.0% ( 
-37% -  -19%) 0.000
         CFQHighHigh       11.09      (6.0%)        9.61      (6.0%)  -13.3% ( 
-23% -   -1%) 0.000
            PKLookup      103.38      (4.4%)      108.04      (4.3%)    4.5% (  
-4% -   13%) 0.001
       CFQHighMedLow       10.58      (6.1%)       12.30      (8.7%)   16.2% (  
 1% -   33%) 0.000
          CFQHighMed       10.70      (7.4%)       15.51     (11.2%)   44.9% (  
24% -   68%) 0.000
       CFQHighLowLow        8.18      (8.2%)       12.87     (11.6%)   57.3% (  
34% -   84%) 0.000
          CFQHighLow       14.57      (7.5%)       30.81     (15.1%)  111.4% (  
82% -  144%) 0.000


# Run 2
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        5.33      (5.7%)        4.02      (7.7%)  -24.4% ( 
-35% -  -11%) 0.000
       CFQHighLowLow       17.14      (6.2%)       13.06      (5.4%)  -23.8% ( 
-33% -  -13%) 0.000
          CFQHighMed       17.37      (5.8%)       14.38      (7.7%)  -17.2% ( 
-29% -   -3%) 0.000
            PKLookup      103.57      (5.5%)      108.84      (5.9%)    5.1% (  
-6% -   17%) 0.005
       CFQHighMedLow       11.25      (7.2%)       12.70      (9.0%)   12.9% (  
-3% -   31%) 0.000
         CFQHighHigh        5.00      (6.2%)        7.54     (12.1%)   51.0% (  
30% -   73%) 0.000
          CFQHighLow       21.60      (5.2%)       34.57     (14.1%)   60.0% (  
38% -   83%) 0.000


# Run 3
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
     CFQHighHighHigh        5.40      (6.9%)        4.06      (5.1%)  -24.8% ( 
-34% -  -13%) 0.000
       CFQHighMedLow        7.64      (7.4%)        5.79      (6.3%)  -24.2% ( 
-35% -  -11%) 0.000
         CFQHighHigh       11.11      (7.0%)        9.60      (5.9%)  -13.6% ( 
-24% -    0%) 0.000
       CFQHighLowLow       21.21      (7.6%)       21.22      (6.6%)    0.0% ( 
-13% -   15%) 0.993
            PKLookup      103.15      (5.9%)      107.60      (6.9%)    4.3% (  
-8% -   18%) 0.034
          CFQHighLow       21.85      (8.1%)       34.18     (13.5%)   56.4% (  
32% -   84%) 0.000
          CFQHighMed       12.07      (8.4%)       19.98     (16.7%)   65.5% (  
37% -   98%) 0.000


# Run 4
                TaskQPS baseline      StdDevQPS my_modified_version      StdDev 
               Pct diff p-value
         CFQHighHigh        8.50      (5.8%)        6.85      (5.2%)  -19.5% ( 
-28% -   -8%) 0.000
       CFQHighMedLow       10.89      (5.7%)        8.96      (5.4%)  -17.8% ( 
-27% -   -7%) 0.000
          CFQHighMed        8.41      (5.8%)        7.74      (5.6%)   -7.9% ( 
-18% -    3%) 0.000
     CFQHighHighHigh        3.45      (6.7%)        3.38      (5.3%)   -2.0% ( 
-13% -   10%) 0.287
       CFQHighLowLow        7.82      (6.4%)        8.20      (7.5%)    4.8% (  
-8% -   20%) 0.030
            PKLookup      103.50      (5.0%)      110.69      (5.4%)    6.9% (  
-3% -   18%) 0.000
          CFQHighLow       11.46      (6.0%)       13.16      (6.7%)   14.8% (  
 1% -   29%) 0.000
{code}
I think overall this shows that the pruning will be most effective when there's 
a significant difference between terms' frequencies, but will slow things down 
if they are close, as the cost of pruning outweighs the efficacy of skipping. 
I'm wondering if we should then gate the pruning by checking the frequencies as 
well, but from some quick trials that seems to be an expensive operation? Do 
you have any recommendation for this scenario?


was (Author: zacharymorn):
Hi [~jpountz], I've implemented a quick optimization to replace combinatorial 
calculation with an upper-bound approximation 
([commit|https://github.com/apache/lucene/pull/418/commits/2ba435e5c83f870be95662c951c9818111843a59])
 .

With this and other bug fixes / optimizations based on CPU profiler, I was able 
to get the following