[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores

2020-03-10 Thread Michele Palmia (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927
 ] 

Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:15 PM:
--

I was actually just looking at a [user 
report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs (while 
[LUCENE-8840|https://issues.apache.org/jira/browse/LUCENE-8840] is not fixed)?


was (Author: micpalmia):
I was actually just looking at a [user 
report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs?

> Blended queries with boolean rewrite can result in inconstitent scores
> --
>
> Key: LUCENE-9269
> URL: https://issues.apache.org/jira/browse/LUCENE-9269
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9269-test.patch
>
>
> If two blended queries are should clauses of a boolean query and are built so 
> that
>  * some of their terms are the same
>  * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE
> the docFreq for the overlapping terms used for scoring is picked as follow:
>  # if the overlapping terms are not boosted, the df of the term in the first 
> blended query is used
>  # if any of the overlapping terms is boosted, the df is picked at (what 
> looks like) random.
> A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).
> {code:java}
> a)
> Blended(f:a f:b) Blended (f:a)
> df: 3 df: 2
> gets rewritten to:
> (f:a)^2.0 (f:b)
> df: 3  df:2
> b)
> Blended(f:a) Blended(f:a f:b)
> df: 2df: 3
> gets rewritten to:
> (f:a)^2.0 (f:b)
>  df: 2 df:2
> c)
> Blended(f:a f:b^0.66) Blended (f:a^0.75)
> df: 3  df: 2
> gets rewritten to:
> (f:a)^1.75 (f:b)^0.66
>  df:?   df:2
> {code}
> with ? either 2 or 3, depending on the run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores

2020-03-10 Thread Michele Palmia (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927
 ] 

Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:07 PM:
--

I was actually just looking at a [user 
report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs?


was (Author: micpalmia):
I was actually just looking at a [user 
report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs?

> Blended queries with boolean rewrite can result in inconstitent scores
> --
>
> Key: LUCENE-9269
> URL: https://issues.apache.org/jira/browse/LUCENE-9269
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9269-test.patch
>
>
> If two blended queries are should clauses of a boolean query and are built so 
> that
>  * some of their terms are the same
>  * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE
> the docFreq for the overlapping terms used for scoring is picked as follow:
>  # if the overlapping terms are not boosted, the df of the term in the first 
> blended query is used
>  # if any of the overlapping terms is boosted, the df is picked at (what 
> looks like) random.
> A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).
> {code:java}
> a)
> Blended(f:a f:b) Blended (f:a)
> df: 3 df: 2
> gets rewritten to:
> (f:a)^2.0 (f:b)
> df: 3  df:2
> b)
> Blended(f:a) Blended(f:a f:b)
> df: 2df: 3
> gets rewritten to:
> (f:a)^2.0 (f:b)
>  df: 2 df:2
> c)
> Blended(f:a f:b^0.66) Blended (f:a^0.75)
> df: 3  df: 2
> gets rewritten to:
> (f:a)^1.75 (f:b)^0.66
>  df:?   df:2
> {code}
> with ? either 2 or 3, depending on the run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores

2020-03-10 Thread Michele Palmia (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927
 ] 

Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:05 PM:
--

I was actually just looking at a [user 
report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs?


was (Author: micpalmia):
I was actually just looking at a [user 
report|[https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser]]
 that came to lucene-dev and looked interesting. In their use case, they were 
using fuzzy queries, that in turn generate blended queries that are affected by 
this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some 
form of warning in the docs?

> Blended queries with boolean rewrite can result in inconstitent scores
> --
>
> Key: LUCENE-9269
> URL: https://issues.apache.org/jira/browse/LUCENE-9269
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9269-test.patch
>
>
> If two blended queries are should clauses of a boolean query and are built so 
> that
>  * some of their terms are the same
>  * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE
> the docFreq for the overlapping terms used for scoring is picked as follow:
>  # if the overlapping terms are not boosted, the df of the term in the first 
> blended query is used
>  # if any of the overlapping terms is boosted, the df is picked at (what 
> looks like) random.
> A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).
> {code:java}
> a)
> Blended(f:a f:b) Blended (f:a)
> df: 3 df: 2
> gets rewritten to:
> (f:a)^2.0 (f:b)
> df: 3  df:2
> b)
> Blended(f:a) Blended(f:a f:b)
> df: 2df: 3
> gets rewritten to:
> (f:a)^2.0 (f:b)
>  df: 2 df:2
> c)
> Blended(f:a f:b^0.66) Blended (f:a^0.75)
> df: 3  df: 2
> gets rewritten to:
> (f:a)^1.75 (f:b)^0.66
>  df:?   df:2
> {code}
> with ? either 2 or 3, depending on the run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores

2020-03-10 Thread Michele Palmia (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055891#comment-17055891
 ] 

Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 12:57 PM:
---

I added a very simple test (with my very limited Lucene testing skills) that 
emulates example c) above and checks for the score of the top document. As 
there is no "right" score, I just check for one of the two possible scores and 
have the test fail on the other.

I'm having a hard time wrapping my head around what the right behavior should 
be in this case (and thus coming up with a more sensible test and fix).

In case that's useful, I should probably add that the randomness in the scoring 
behavior is due to the HashMap underlying MultiSet: when should clauses are 
processed for deduplication, they're served in an arbitrary order (see 
[BooleanQuery.java:370|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L370]])


was (Author: micpalmia):
I added a very simple test (with my very limited Lucene testing skills) that 
simply emulates example c) above and checks for the score of the top document. 
As there is no "right" score, I just check for one of the two possible scores 
and have the test fail on the other.

I'm having a hard time wrapping my head around what the right behavior should 
be in this case (and thus coming up with a more sensible test and fix).

> Blended queries with boolean rewrite can result in inconstitent scores
> --
>
> Key: LUCENE-9269
> URL: https://issues.apache.org/jira/browse/LUCENE-9269
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9269-test.patch
>
>
> If two blended queries are should clauses of a boolean query and are built so 
> that
>  * some of their terms are the same
>  * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE
> the docFreq for the overlapping terms used for scoring is picked as follow:
>  # if the overlapping terms are not boosted, the df of the term in the first 
> blended query is used
>  # if any of the overlapping terms is boosted, the df is picked at (what 
> looks like) random.
> A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3).
> {code:java}
> a)
> Blended(f:a f:b) Blended (f:a)
> df: 3 df: 2
> gets rewritten to:
> (f:a)^2.0 (f:b)
> df: 3  df:2
> b)
> Blended(f:a) Blended(f:a f:b)
> df: 2df: 3
> gets rewritten to:
> (f:a)^2.0 (f:b)
>  df: 2 df:2
> c)
> Blended(f:a f:b^0.66) Blended (f:a^0.75)
> df: 3  df: 2
> gets rewritten to:
> (f:a)^1.75 (f:b)^0.66
>  df:?   df:2
> {code}
> with ? either 2 or 3, depending on the run.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org