[ 
https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-9107:
------------------------------------
    Description: 
In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of 
(duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency 
terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0. 
However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead 
[2].
After digging a bit into it it seems that the regression in speed comes from 
the fact that top-k scoring introduced by default in version 8 is causing that, 
not sure "where" exactly in the code though.
When switching back to complete hit scoring [3], the speed goes back to the 
initial 2-300ms also in Lucene 8.3.x.
It'd be nice to understand the reason why this is happening and if it is only 
concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
If this is a case that depends on the data and application involved (Anserini 
in this case), the application should handle it, otherwise if it is a 
regression/bug in Lucene it'd be nice to fix it.

[1] : 
https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
[2] : 
https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
[3] : 
https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174

  was:
In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of 
(duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency 
terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0. 
However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead 
[2].
After digging a bit into it it seems that the regression in speed comes from 
the fact that top-k scoring introduced by default in version 8 is causing that, 
not sure "where" exactly in the code though.
When switching back to complete hit scoring [3], the speed goes back to the 
initial 2-300ms also in Lucene 8.3.x.
It'd be nice to understand the reason why this is happening and if it is only 
concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
If this is a case that depends on the data and application involved (Anserini 
in this case) otherwise if it is a regression/bug in Lucene it'd be nice to fix 
it.

[1] : 
https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
[2] : 
https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
[3] : 
https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174


> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9107
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9107
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 8.3
>            Reporter: Tommaso Teofili
>            Priority: Major
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots 
> of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low 
> frequency terms, the query, although big, finishes in around 2-300ms with 
> Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s 
> instead [2].
> After digging a bit into it it seems that the regression in speed comes from 
> the fact that top-k scoring introduced by default in version 8 is causing 
> that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the 
> initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only 
> concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini 
> in this case), the application should handle it, otherwise if it is a 
> regression/bug in Lucene it'd be nice to fix it.
> [1] : 
> https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : 
> https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : 
> https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to