[jira] [Updated] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-9335:
--
Attachment: JFR result for BMM scorer with optimizations May 22.png

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: JFR result for BMM scorer with optimizations May 22.png, 
> MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349941#comment-17349941
 ] 

Zach Chen commented on LUCENE-9335:
---

Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*

 
{code:java}
AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894{code}
 

*BMM Scorer*

 
{code:java}
AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253
{code}
 

I've also attached "JFR result for BMM scorer with optimizations May 22" for 
the BMM scorer profiling result from the latest changes. Overall, it seems that 
the larger number of docs collected by BMM is becoming a bottleneck for 
performance, as around 50% of the computation was spent by 
SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute 
score for candidate doc (around 34% of the computation was spent to find the 
next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more 
docs faster, it should be able to improve BMM further.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: JFR result for BMM scorer with optimizations May 22.png, 
> MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-23 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349941#comment-17349941
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/23/21, 7:13 AM:
-

Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*
{code:java}
AVG: 23,252,992.375
P25: 6,298,463
P50: 13,007,148
P75: 26,868,222
P90: 56,683,505
P95: 84,333,397
P99: 154,185,321
Collected AVG: 8,168.523
Collected P25: 1,548
Collected P50: 2,259
Collected P75: 3,735
Collected P90: 6,228
Collected P95: 13,063
Collected P99: 221,894{code}
*BMM Scorer*
{code:java}
AVG: 41,970,641.638
P25: 8,654,210
P50: 21,553,366
P75: 51,519,172
P90: 109,510,378
P95: 154,534,017
P99: 266,141,446
Collected AVG: 16,810.392
Collected P25: 2,769
Collected P50: 7,159
Collected P75: 20,077
Collected P90: 43,031
Collected P95: 69,984
Collected P99: 135,253
{code}
I've also attached "JFR result for BMM scorer with optimizations May 22" for 
the BMM scorer profiling result from the latest changes. Overall, it seems that 
the larger number of docs collected by BMM is becoming a bottleneck for 
performance, as around 50% of the computation was spent by 
SimpleTopScoreDocCollector#collect / BlockMaxMaxscoreScorer#score to compute 
score for candidate doc (around 34% of the computation was spent to find the 
next doc in BlockMaxMaxscoreScorer#nextDoc). If there's a way to prune more 
docs faster, it should be able to improve BMM further.


was (Author: zacharymorn):
Hi [~jpountz], I've tried out a few ideas in the last few days and they gave 
some improvements (but also made it worse for OrMedMedMedMedMed). However, it 
was still not as performing as BMW for MSMARCO passages dataset. The ideas I 
tried include:
 # Move scorer from essential to non-essential list when minCompetitiveScore 
increases (mentioned in the paper)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Use score.score instead of maxScore for candidate doc evaluation against 
minCompetitiveScore to prune more docs (reverting your previous optimization)
 ## commit: 
**[https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288|https://github.com/apache/lucene/pull/101/commits/e5f10e31a84c0bab687fbac7d3f05274472a1288]]
 # Reduce maxScore contribution from non-essential list during candidate doc 
evaluation for scorer that cannot match 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/881dbf8fc1c04b8c5d2cb0f19e4e3e44ef595f3d]
 # Use the maximum of each scorer's upTo for maxScore boundary instead of 
minimum (opposed to what the paper suggested) 
 ## commit: 
[https://github.com/apache/lucene/pull/101/commits/466a2d9292e300cbf00312f3477d95a14c41c188]
 ## This causes OrMedMedMedMedMed to degrade by 40%

Collectively, these gave 70~90% performance boost to OrHighHigh, 60~150% for 
OrHighMed, and smaller improvement for AndHighOrMedMed, but at the expense of 
OrMedMedMedMedMed performance (by -40% with #4 changes).

For MSMARCO passages dataset, they now give the following results (modified 
slightly from your version to show more percentile, and to add comma to 
separate digits for readability):

*BMW Scorer*

 

[jira] [Created] (LUCENE-9969) DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机

2021-05-23 Thread FengFeng Cheng (Jira)
FengFeng Cheng created LUCENE-9969:
--

 Summary: DirectoryTaxonomyReader.taxoArray占用内存较大导致系统OOM宕机
 Key: LUCENE-9969
 URL: https://issues.apache.org/jira/browse/LUCENE-9969
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Affects Versions: 6.6.2
Reporter: FengFeng Cheng
 Attachments: image-2021-05-24-13-43-43-289.png

首先数据量很大,jvm内存为90G,但是TaxonomyIndexArrays几乎占走了一半

!image-2021-05-24-13-43-43-289.png!

请问对于TaxonomyReader是否有更好的使用方式或者其他的优化?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org