[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:34 PM:
-------------------------------------------------------------

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

 

{{{"id":"1", "field1_ws":"XY GB"}}}

{{{}{"id":"2", "field1_ws":"XY", "field2_ws":"GB", 
"field2_txt":"GB"}{}}}{{{}{}}}

{{{"id":"3", "field1_ws":"XY GC"}}}

{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

 

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 

> eDismax should use startOffset when converting per-field to per-term queries
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-16594
>                 URL: https://issues.apache.org/jira/browse/SOLR-16594
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>            Reporter: Rudi Seitz
>            Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, and then checking whether 
> all of these per-field queries have the same structure. The structure will 
> generally be uniform if each analysis chain emits the same number of tokens 
> for the given input. If one chain has a synonym filter and another doesn’t, 
> this uniformity may depend on whether a synonym rule happened to match a term 
> in the user's input. 
> Assuming the per-field boolean queries _do_ have the same structure, edismax 
> reorganizes them into a new boolean query. The new query contains a dismax 
> for each clause position in the original queries. If the original queries are 
> {{(f1:foo f1:bar) }}and {{(f2:foo f2:bar)}} we can see they have two clauses 
> each, so we would get a dismax containing all the first position clauses 
> {{(f1:foo f1:bar)}} and another dismax containing all the second position 
> clauses {{{}(f2:foo f2:bar){}}}.
> We can see that edismax is using clause position as a heuristic to reorganize 
> the per-field boolean queries into per-term ones, even though it doesn't know 
> for sure which clauses inside those per-field boolean queries are related to 
> which input terms. We propose that a better way of reorganizing the per-field 
> boolean queries is to create a dismax for each distinct startOffset seen 
> among the tokens in the token streams emitted by each field analyzer. The 
> startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position 
> of the first character corresponding to this token in the source text".
> We propose that startOffset is a resonable way of matching output tokens up 
> with the input terms that gave rise to them. For example, if we pass "foo 
> bar" through an ngram analysis chain we see that the foo-related tokens all 
> have startOffset=0 while the bar-related tokens all have startOffset=4. 
> Likewise, tokens that are generated via synonym expansion have a startOffset 
> that points to the beginning of the matching input term. For example, if the 
> query "GB" generates "GB gib gigabyte gigabytes" via synonym expansion, all 
> of those four tokens would have startOffset=0.
> Here's an example of how the proposed edismax logic would work. Let's say a 
> user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
> standard text analysis chain while f2 generates ngrams. We would get 
> field-centric queries {{(f1:foo f1:bar)}} and ({{{}f2:f f2:fo f2:foo f2:b 
> f2:ba f2:bar){}}}. Edismax's "all same query structure" check would fail 
> here, but if we look for the unique startOffsets seen among all the tokens we 
> would find offsets 0 and 4. We could then generate one clause for all the 
> startOffset=0 tokens {{(f1:foo f2:f f2:fo f2:foo)}} and another for all the 
> startOffset=4 tokens: {{{}(f1:bar f2:b f2:ba f2:bar){}}}. This would 
> effectively give us a "term-centric" query with consistent mm and scoring 
> semantics, even though the analysis chains are not "compatible."
> As mentioned, there would be significant plumbing needed to make startOffsets 
> available to edismax in the code where the per-field queries are converted 
> into per-term queries. Modifications would possibly be needed in both the 
> Solr and Lucene repos. This ticket is logged in hopes of gathering feedback 
> about whether this is a worthwhile/viable approach to pursue further.
>  
> Related tickets:
> https://issues.apache.org/jira/browse/SOLR-12779
> https://issues.apache.org/jira/browse/SOLR-15407
>  
> Related blog entries:
> [https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities]
> [https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to