[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and 

[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar) }}and{{ (f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and