[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores
[ https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927 ] Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:15 PM: -- I was actually just looking at a [user report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs (while [LUCENE-8840|https://issues.apache.org/jira/browse/LUCENE-8840] is not fixed)? was (Author: micpalmia): I was actually just looking at a [user report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs? > Blended queries with boolean rewrite can result in inconstitent scores > -- > > Key: LUCENE-9269 > URL: https://issues.apache.org/jira/browse/LUCENE-9269 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.4 >Reporter: Michele Palmia >Priority: Minor > Attachments: LUCENE-9269-test.patch > > > If two blended queries are should clauses of a boolean query and are built so > that > * some of their terms are the same > * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE > the docFreq for the overlapping terms used for scoring is picked as follow: > # if the overlapping terms are not boosted, the df of the term in the first > blended query is used > # if any of the overlapping terms is boosted, the df is picked at (what > looks like) random. > A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3). > {code:java} > a) > Blended(f:a f:b) Blended (f:a) > df: 3 df: 2 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 3 df:2 > b) > Blended(f:a) Blended(f:a f:b) > df: 2df: 3 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 2 df:2 > c) > Blended(f:a f:b^0.66) Blended (f:a^0.75) > df: 3 df: 2 > gets rewritten to: > (f:a)^1.75 (f:b)^0.66 > df:? df:2 > {code} > with ? either 2 or 3, depending on the run. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores
[ https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927 ] Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:07 PM: -- I was actually just looking at a [user report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/%3CCALyzSEn%2BQFoT3MpNYkxw-dEK9jc59mSTvXqccuUVMMDAgOMMmA%40mail.gmail.com%3E] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs? was (Author: micpalmia): I was actually just looking at a [user report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs? > Blended queries with boolean rewrite can result in inconstitent scores > -- > > Key: LUCENE-9269 > URL: https://issues.apache.org/jira/browse/LUCENE-9269 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.4 >Reporter: Michele Palmia >Priority: Minor > Attachments: LUCENE-9269-test.patch > > > If two blended queries are should clauses of a boolean query and are built so > that > * some of their terms are the same > * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE > the docFreq for the overlapping terms used for scoring is picked as follow: > # if the overlapping terms are not boosted, the df of the term in the first > blended query is used > # if any of the overlapping terms is boosted, the df is picked at (what > looks like) random. > A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3). > {code:java} > a) > Blended(f:a f:b) Blended (f:a) > df: 3 df: 2 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 3 df:2 > b) > Blended(f:a) Blended(f:a f:b) > df: 2df: 3 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 2 df:2 > c) > Blended(f:a f:b^0.66) Blended (f:a^0.75) > df: 3 df: 2 > gets rewritten to: > (f:a)^1.75 (f:b)^0.66 > df:? df:2 > {code} > with ? either 2 or 3, depending on the run. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores
[ https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055927#comment-17055927 ] Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 1:05 PM: -- I was actually just looking at a [user report|https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs? was (Author: micpalmia): I was actually just looking at a [user report|[https://mail-archives.apache.org/mod_mbox/lucene-dev/202003.mbox/browser]] that came to lucene-dev and looked interesting. In their use case, they were using fuzzy queries, that in turn generate blended queries that are affected by this issue. Maybe users of BlendedQuery/FuzzyQuery should be able to find some form of warning in the docs? > Blended queries with boolean rewrite can result in inconstitent scores > -- > > Key: LUCENE-9269 > URL: https://issues.apache.org/jira/browse/LUCENE-9269 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.4 >Reporter: Michele Palmia >Priority: Minor > Attachments: LUCENE-9269-test.patch > > > If two blended queries are should clauses of a boolean query and are built so > that > * some of their terms are the same > * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE > the docFreq for the overlapping terms used for scoring is picked as follow: > # if the overlapping terms are not boosted, the df of the term in the first > blended query is used > # if any of the overlapping terms is boosted, the df is picked at (what > looks like) random. > A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3). > {code:java} > a) > Blended(f:a f:b) Blended (f:a) > df: 3 df: 2 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 3 df:2 > b) > Blended(f:a) Blended(f:a f:b) > df: 2df: 3 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 2 df:2 > c) > Blended(f:a f:b^0.66) Blended (f:a^0.75) > df: 3 df: 2 > gets rewritten to: > (f:a)^1.75 (f:b)^0.66 > df:? df:2 > {code} > with ? either 2 or 3, depending on the run. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9269) Blended queries with boolean rewrite can result in inconstitent scores
[ https://issues.apache.org/jira/browse/LUCENE-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055891#comment-17055891 ] Michele Palmia edited comment on LUCENE-9269 at 3/10/20, 12:57 PM: --- I added a very simple test (with my very limited Lucene testing skills) that emulates example c) above and checks for the score of the top document. As there is no "right" score, I just check for one of the two possible scores and have the test fail on the other. I'm having a hard time wrapping my head around what the right behavior should be in this case (and thus coming up with a more sensible test and fix). In case that's useful, I should probably add that the randomness in the scoring behavior is due to the HashMap underlying MultiSet: when should clauses are processed for deduplication, they're served in an arbitrary order (see [BooleanQuery.java:370|[https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java#L370]]) was (Author: micpalmia): I added a very simple test (with my very limited Lucene testing skills) that simply emulates example c) above and checks for the score of the top document. As there is no "right" score, I just check for one of the two possible scores and have the test fail on the other. I'm having a hard time wrapping my head around what the right behavior should be in this case (and thus coming up with a more sensible test and fix). > Blended queries with boolean rewrite can result in inconstitent scores > -- > > Key: LUCENE-9269 > URL: https://issues.apache.org/jira/browse/LUCENE-9269 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 8.4 >Reporter: Michele Palmia >Priority: Minor > Attachments: LUCENE-9269-test.patch > > > If two blended queries are should clauses of a boolean query and are built so > that > * some of their terms are the same > * their rewrite method is BlendedTermQuery.BOOLEAN_REWRITE > the docFreq for the overlapping terms used for scoring is picked as follow: > # if the overlapping terms are not boosted, the df of the term in the first > blended query is used > # if any of the overlapping terms is boosted, the df is picked at (what > looks like) random. > A few examples using a field with 2 terms: f:a (df: 2), and f:b (df: 3). > {code:java} > a) > Blended(f:a f:b) Blended (f:a) > df: 3 df: 2 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 3 df:2 > b) > Blended(f:a) Blended(f:a f:b) > df: 2df: 3 > gets rewritten to: > (f:a)^2.0 (f:b) > df: 2 df:2 > c) > Blended(f:a f:b^0.66) Blended (f:a^0.75) > df: 3 df: 2 > gets rewritten to: > (f:a)^1.75 (f:b)^0.66 > df:? df:2 > {code} > with ? either 2 or 3, depending on the run. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org