[ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-10236:
-------------------------------
    Description: 
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:
{code:java}
CombinedFieldQuery query =
    new CombinedFieldQuery.Builder()
        .addField("field1", (float) 1.0)
        .addField("field2", (float) 1.0)
        .addTerm(new BytesRef("foo"))
        .addTerm(new BytesRef("zoo"))
        .build(); {code}
I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

 

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 

  was:
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:

 
{code:java}
CombinedFieldQuery query =
    new CombinedFieldQuery.Builder()
        .addField("field1", (float) 1.0)
        .addField("field2", (float) 1.0)
        .addTerm(new BytesRef("foo"))
        .addTerm(new BytesRef("zoo"))
        .build(); {code}
 

I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 


> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10236
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10236
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/sandbox
>            Reporter: Zach Chen
>            Assignee: Zach Chen
>            Priority: Minor
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently CombinedFieldsQuery would use a constructed 
> [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
>  object to create a MultiNormsLeafSimScorer for scoring, but the fields 
> object may contain duplicated field-weight pairs as it is [built from looping 
> over 
> fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
>  resulting into duplicated norms being added during scoring calculation in 
> MultiNormsLeafSimScorer. 
> E.g. for CombinedFieldsQuery with two fields and two values matching a 
> particular doc:
> {code:java}
> CombinedFieldQuery query =
>     new CombinedFieldQuery.Builder()
>         .addField("field1", (float) 1.0)
>         .addField("field2", (float) 1.0)
>         .addTerm(new BytesRef("foo"))
>         .addTerm(new BytesRef("zoo"))
>         .build(); {code}
> I would imagine the scoring to be based on the following:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2)
> but the current logic would use the following for scoring:
>  # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + 
> freq(field1:zoo) + freq(field2:zoo)
>  # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
> norm(field2)
>  
> In addition, this differs from how MultiNormsLeafSimScorer is constructed 
> from CombinedFieldsQuery explain function, which [uses 
> fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
>  and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to