[jira] [Commented] (SOLR-17279) consolidate security.json constants in test code

2024-05-07 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844410#comment-17844410
 ] 

Rudi Seitz commented on SOLR-17279:
---

PR: https://github.com/apache/solr/pull/2445

> consolidate security.json constants in test code
> 
>
> Key: SOLR-17279
> URL: https://issues.apache.org/jira/browse/SOLR-17279
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Rudi Seitz
>Priority: Trivial
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Various unit tests declare a SECURITY_JSON constant. Maybe of these constants 
> are identical (likely introduced via copy/paste from other tests, resulting 
> in code duplication). This ticket is to consolidate the various SECURITY_JSON 
> constants across the tests in a central place.
> This point was raised during discussion of the fix for SOLR-12813. See 
> https://github.com/apache/solr/pull/2404#discussion_r1568012056
> In that discussion, it was agreed to address SECURITY_JSON consolidation in a 
> separate ticket -- this is that ticket. 
> Tagging [~epugh]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-17279) consolidate security.json constants in test code

2024-05-07 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844295#comment-17844295
 ] 

Rudi Seitz commented on SOLR-17279:
---

Will open a PR for this shortly.

> consolidate security.json constants in test code
> 
>
> Key: SOLR-17279
> URL: https://issues.apache.org/jira/browse/SOLR-17279
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Rudi Seitz
>Priority: Trivial
>
> Various unit tests declare a SECURITY_JSON constant. Maybe of these constants 
> are identical (likely introduced via copy/paste from other tests, resulting 
> in code duplication). This ticket is to consolidate the various SECURITY_JSON 
> constants across the tests in a central place.
> This point was raised during discussion of the fix for SOLR-12813. See 
> https://github.com/apache/solr/pull/2404#discussion_r1568012056
> In that discussion, it was agreed to address SECURITY_JSON consolidation in a 
> separate ticket -- this is that ticket. 
> Tagging [~epugh]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Created] (SOLR-17279) consolidate security.json constants in test code

2024-05-07 Thread Rudi Seitz (Jira)
Rudi Seitz created SOLR-17279:
-

 Summary: consolidate security.json constants in test code
 Key: SOLR-17279
 URL: https://issues.apache.org/jira/browse/SOLR-17279
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Tests
Reporter: Rudi Seitz


Various unit tests declare a SECURITY_JSON constant. Maybe of these constants 
are identical (likely introduced via copy/paste from other tests, resulting in 
code duplication). This ticket is to consolidate the various SECURITY_JSON 
constants across the tests in a central place.

This point was raised during discussion of the fix for SOLR-12813. See 
https://github.com/apache/solr/pull/2404#discussion_r1568012056
In that discussion, it was agreed to address SECURITY_JSON consolidation in a 
separate ticket -- this is that ticket. 

Tagging [~epugh]




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-12813) SolrCloud + 2 shards + subquery + auth = 401 Exception

2024-04-24 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840607#comment-17840607
 ] 

Rudi Seitz commented on SOLR-12813:
---

Yes, this issue is about BasicAuthPlugin, configured similarly to what is 
described in the reference guide 
[here|https://solr.apache.org/guide/solr/latest/deployment-guide/basic-authentication-plugin.html#enable-basic-authentication]

This ticket is basically saying that the transparent instrumentation of 
AuthenticationPlugin can break in some cases – specifically in the scenario of 
a subquery executed in a multi-shard environment.

So why does it break in this particular scenario and not elsewhere? I'll try to 
provide more detail later, but the basic idea is that the 
SubQueryAgumenterFactory generates _new_ queries that do not share all the 
state of the incoming request. And these queries are processed using an 
EmbeddedSolrServer that doesn't respect the way BasicAuthPlugin is trying to be 
transparently instrumented. My [PR|https://github.com/apache/solr/pull/2404] 
shows the specific places where these problems arise and how they can be fixed.

To quickly reproduce the issue described in this issue, one can apply the 
changes I made to TestSubQueryTransformerDistrib so that basic auth is enabled. 
The modified test should fail against main, without also applying the other 
changes in the PR that fix the underlying issue. 
https://github.com/apache/solr/commit/d2503ffd9a7cd58c4449c83ff940b63541fce251


 

> SolrCloud + 2 shards + subquery + auth = 401 Exception
> --
>
> Key: SOLR-12813
> URL: https://issues.apache.org/jira/browse/SOLR-12813
> Project: Solr
>  Issue Type: Bug
>  Components: security, SolrCloud
>Affects Versions: 6.4.1, 7.5, 8.11
>Reporter: Igor Fedoryn
>Priority: Major
> Attachments: screen1.png, screen2.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Environment: * Solr 6.4.1
>  * Zookeeper 3.4.6
>  * Java 1.8
> Run Zookeeper
> Upload simple configuration wherein the Solr schema has fields for a 
> relationship between parent/child
> Run two Solr instance (2 nodes)
> Create the collection with 1 shard on each Solr nodes
>  
> Add parent document to one shard and child document to another shard.
> The response for: * 
> /select?q=ChildIdField:VALUE=*,parents:[subqery]=\{!term f=id 
> v=$row.ParentIdsField}
> correct.
>  
> After that add Basic Authentication with some user for collection.
> Restart Solr or reload Solr collection.
> If the simple request /select?q=*:* with authorization on Solr server is a 
> success then run previously request
> with authorization on Solr server and you get the exception: "Solr HTTP 
> error: Unauthorized (401) "
>  
> Screens in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-12813) SolrCloud + 2 shards + subquery + auth = 401 Exception

2024-04-16 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837906#comment-17837906
 ] 

Rudi Seitz commented on SOLR-12813:
---

Here's a PR: https://github.com/apache/solr/pull/2404

> SolrCloud + 2 shards + subquery + auth = 401 Exception
> --
>
> Key: SOLR-12813
> URL: https://issues.apache.org/jira/browse/SOLR-12813
> Project: Solr
>  Issue Type: Bug
>  Components: security, SolrCloud
>Affects Versions: 6.4.1, 7.5, 8.11
>Reporter: Igor Fedoryn
>Priority: Major
> Attachments: screen1.png, screen2.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Environment: * Solr 6.4.1
>  * Zookeeper 3.4.6
>  * Java 1.8
> Run Zookeeper
> Upload simple configuration wherein the Solr schema has fields for a 
> relationship between parent/child
> Run two Solr instance (2 nodes)
> Create the collection with 1 shard on each Solr nodes
>  
> Add parent document to one shard and child document to another shard.
> The response for: * 
> /select?q=ChildIdField:VALUE=*,parents:[subqery]=\{!term f=id 
> v=$row.ParentIdsField}
> correct.
>  
> After that add Basic Authentication with some user for collection.
> Restart Solr or reload Solr collection.
> If the simple request /select?q=*:* with authorization on Solr server is a 
> success then run previously request
> with authorization on Solr server and you get the exception: "Solr HTTP 
> error: Unauthorized (401) "
>  
> Screens in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Updated] (SOLR-16594) improve eDismax strategy for generating a term-centric query

2024-04-16 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and 

[jira] [Commented] (SOLR-12813) SolrCloud + 2 shards + subquery + auth = 401 Exception

2024-04-16 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837803#comment-17837803
 ] 

Rudi Seitz commented on SOLR-12813:
---

I have begun implementing a fix here: 
[https://github.com/rseitz/solr/commit/c51f038f33b21411ce5c01ccf6d9f4d17690d82b]

I found two separate places where credentials are lost. First, the 
SubQueryAugmenterFactor never sets credentials on the subqueries that it 
generates. Second, when a subquery is handled by EmbeddedSolrServer, the query 
goes through various transformations that would drop credentials if they had 
been present in the first place. The code I'm sharing here fixes both issues 
and I've tested it manually with collection with 2 shards in a 2-node cluster. 
The fix only works with forwardCredentials=true.

I am working on writing a unit test and creating a PR. In the meantime, I'm 
eager for any feedback on the proposed changes.

> SolrCloud + 2 shards + subquery + auth = 401 Exception
> --
>
> Key: SOLR-12813
> URL: https://issues.apache.org/jira/browse/SOLR-12813
> Project: Solr
>  Issue Type: Bug
>  Components: security, SolrCloud
>Affects Versions: 6.4.1, 7.5, 8.11
>Reporter: Igor Fedoryn
>Priority: Major
> Attachments: screen1.png, screen2.png
>
>
> Environment: * Solr 6.4.1
>  * Zookeeper 3.4.6
>  * Java 1.8
> Run Zookeeper
> Upload simple configuration wherein the Solr schema has fields for a 
> relationship between parent/child
> Run two Solr instance (2 nodes)
> Create the collection with 1 shard on each Solr nodes
>  
> Add parent document to one shard and child document to another shard.
> The response for: * 
> /select?q=ChildIdField:VALUE=*,parents:[subqery]=\{!term f=id 
> v=$row.ParentIdsField}
> correct.
>  
> After that add Basic Authentication with some user for collection.
> Restart Solr or reload Solr collection.
> If the simple request /select?q=*:* with authorization on Solr server is a 
> success then run previously request
> with authorization on Solr server and you get the exception: "Solr HTTP 
> error: Unauthorized (401) "
>  
> Screens in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Comment Edited] (SOLR-16594) improve eDismax strategy for generating a term-centric query

2023-03-15 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650967#comment-17650967
 ] 

Rudi Seitz edited comment on SOLR-16594 at 3/15/23 4:23 PM:


This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Add an int startOffset to org.apache.lucene.index.Term. Alternatively, 
create a TermWithOffset subclass of Term
 # Update org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains Terms with the startOffset 
properly set. This is the place where we iterate through the token stream and 
have access to the offsets so we can store them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}

 

{color:#808080}UPDATE: 3/15/2023{color}

{color:#808080}Instead of updating the Lucene Term class I found it was 
possible store the startOffset on the Query objects generated during parsing. 
This eliminates storage overhead and allows the changes to be made entirely 
inside the solr codebase.{color}


was (Author: JIRAUSER297477):
This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Add an int startOffset to org.apache.lucene.index.Term. Alternatively, 
create a TermWithOffset subclass of Term
 # Update org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains Terms with the startOffset 
properly set. This is the place where we iterate through the token stream and 
have access to the offsets so we can store them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}

> improve eDismax strategy for generating a term-centric query
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When parsing a multi-term query that spans multiple fields, edismax attempts 
> to generate a term-centric query structure
>  
> sometimes switches from a "term-centric" to a "field-centric" approach. This 
> creates inconsistent semantics for the {{mm}} or "min should match" parameter 
> and may have an impact on scoring. The goal of this ticket is to improve the 
> approach that edismax uses for generating term-centric queries so that 
> edismax would less frequently "give up" and resort to the field-centric 
> approach. Specifically, we propose that edismax should create a dismax query 
> for each distinct startOffset found among the tokens emitted by the field 
> analyzers. Since the relevant code in edismax works with Query objects that 
> contain Terms, and since Terms do not hold the startOffset of the Token from 
> which Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, 

[jira] [Updated] (SOLR-16594) improve eDismax strategy for generating a term-centric query

2023-03-15 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax attempts to 
generate a term-centric query structure

 

sometimes switches from a "term-centric" to a "field-centric" approach. This 
creates inconsistent semantics for the {{mm}} or "min should match" parameter 
and may have an impact on scoring. The goal of this ticket is to improve the 
approach that edismax uses for generating term-centric queries so that edismax 
would less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would 

[jira] [Updated] (SOLR-16594) improve eDismax strategy for generating a term-centric query

2023-03-15 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query, edismax attempts to use a term-centric 
approach. The desired query structure should have one clause for each term. 
Each term-centric clause should test the given term against each qf field. 
However, when sow=false and the field analyzers generate differing numbers of 
tokens, edismax gives up on the term-centric apporach and reverts to a 
field-centric approach. This "flip" in parsing strategies is difficult to 
predict as it can depend on query-time considerations like whether or not a 
synonym rule was invoked. The "flip" creates inconsistent semantics for the 
{{mm}} or "min should match" parameter and may have an impact on scoring.

The goal of this ticket is to improve the approach that edismax uses for 
generating term-centric queries so that edismax would less frequently "give up" 
and resort to the field-centric approach. Specifically, we propose that edismax 
should generate each term-centric clause by considering the startOffsets of the 
tokens emitted by the field analyzers. This new strategy for creating a 
term-centric query could be applied in a broader range of cases than edismax's 
current strategy which requires that all field analyzers generate the same 
number of tokens.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates 

[jira] [Updated] (SOLR-16594) improve eDismax strategy for generating a term-centric query

2023-03-15 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Summary: improve eDismax strategy for generating a term-centric query  
(was: eDismax should use startOffset when converting per-field to per-term 
queries)

> improve eDismax strategy for generating a term-centric query
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, and then checking whether 
> all of these per-field queries have the same structure. The structure will 
> generally be uniform if each analysis chain emits the same number of tokens 
> for the given input. If one chain has a synonym filter and another doesn’t, 
> this uniformity may depend on whether a synonym rule happened to match a term 
> in the user's input.
> Assuming the per-field boolean queries _do_ have the same structure, edismax 
> reorganizes them into a new boolean query. The new query contains a dismax 
> for each clause position in the original queries. If the original queries are 
> {{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
> each, so we would get a dismax containing all the first position clauses 
> {{(f1:foo f1:bar)}} and another dismax containing all the second position 
> clauses {{{}(f2:foo f2:bar){}}}.
> We can see that edismax is using clause position as a heuristic to reorganize 
> the per-field boolean queries into per-term ones, even though it doesn't know 
> for sure which clauses inside those per-field boolean queries are related to 
> which input terms. We propose that a better way of reorganizing the per-field 
> boolean queries is to create a dismax for each distinct startOffset seen 
> among the tokens in the token streams emitted by each field analyzer. The 
> startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position 
> of the first character corresponding to this token in the source text".
> We propose that startOffset is a resonable way of matching output tokens up 
> with the input terms that gave rise to them. For example, if we pass "foo 
> bar" through an ngram analysis chain we see that 

[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2023-02-28 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694726#comment-17694726
 ] 

Rudi Seitz commented on SOLR-16594:
---

Progress on protoyping the changes discussed above:

BEFORE CHANGE:

Working with Solr 91.1 I create a collection named "test" from _default schema.

Upload this CSV with the four documents mentioned in the test case above:

{{id,field1_ws,field2_ws,field2_txt}}
{{1,XY GB,,}}
{{2,XY,GB,GB}}
{{3,XY GC,,}}
{{4,XY,GC,GC}}

Run the first query: 

[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws=true=true]

We see it is parsed like this:

+(DisjunctionMaxQuery((field2_ws:XY | field1_ws:XY)) 
DisjunctionMaxQuery((field1_ws:GB | field2_ws:GB)))~2

There are two dismax clauses here, and each is term-centric (one clause looks 
for term XY in both fields, the other looks for GB in both fields).

Now run the second query:

[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt=true=true]

We see it is parsed like this:

+DisjunctionMaxQueryfield2_txt:xy Synonym(field2_txt:gb field2_txt:gib 
field2_txt:gigabyte field2_txt:gigabytes))~2) | ((field2_ws:XY field2_ws:GB)~2) 
| ((field1_ws:XY field1_ws:GB)~2)))

If we unpack this, we find three field-centric clauses in there (one clause 
looks for each term in field1_ws, another looks for each term in field2_ws, and 
so on).

AFTER CHANGE:

After my code changes (not yet in a state to share), the first query is parsed 
the same way as above, using a term-centric approach.

The second query is now ALSO parsed in a term-centric way as opposed to being 
field-centric:

+(DisjunctionMaxQuery((field2_ws:XY | field2_txt:xy | field1_ws:XY)) 
DisjunctionMaxQuery((Synonym(field2_txt:gb field2_txt:gib field2_txt:gigabyte 
field2_txt:gigabytes) | field2_ws:GB | field1_ws:GB)))~2

If we look in there, we see two top-level clauses, one for XY and one for GB. 
Docs 1 and 2 are both returned.

What is this solving? Well, it gives us a way to stay with a term-centric 
approach even when the analyzers in a multi-field query are "incompatible," 
meaning they generate differing numbers of tokens.

The code follows the outline from an earlier comment. It requires some changes 
in the lucene codebase and some in solr. In the lucene codebase, Term is 
updated to store a startOffset and QueryBuilder is updated to properly set the 
startOffset when creating new Terms.

In the solr codebase, the changes are mostly limited to 
getAliasedMultiTermQuery in ExtendedDismaxQParser. We look through the list of 
multi-term field-centric Queries. We create a SortedMap that lets us look up an 
Integer startOffset and get all of the BooleanClauses from inside the original 
queries that "begin" with that startOffset. We make a simplifying assumption 
that any BooleanClause we're considering is either a TermQuery or a 
SynonymQuery so we can look at its Term and get a startOffset from there. 
Finally we make a new Query by iterating through all the startOffsets in our 
SortedMap. All the Queries for a given startOffset are added to a 
DisjunctionMaxQuery that then becomes a new BooleanClause of the top-level 
Query we're building.

Phew! It would be nice to hear from folks about this with any feedback :) I'll 
work on getting the prototype code ready to look at.

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> 

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2023-02-28 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650967#comment-17650967
 ] 

Rudi Seitz edited comment on SOLR-16594 at 2/28/23 7:38 PM:


This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Add an int startOffset to org.apache.lucene.index.Term. Alternatively, 
create a TermWithOffset subclass of Term
 # Update org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains Terms with the startOffset 
properly set. This is the place where we iterate through the token stream and 
have access to the offsets so we can store them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}


was (Author: JIRAUSER297477):
This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Create a subclass of org.apache.lucene.index.Term that is capable of holding 
a startOffset. Possibly name it TermWithOffset
 # Update or subclass org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains one or more TermWithOffset 
instead of simple Terms, where appropriate. This is the place where we iterate 
through the token stream and have access to the offsets to potentially store 
them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, 

[jira] [Comment Edited] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-13 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687959#comment-17687959
 ] 

Rudi Seitz edited comment on SOLR-16652 at 2/13/23 2:32 PM:


If the original rule is "foo bar,baz" I believe Mikhail's suggestion to convert 
it to a directional rule would work, but we'd need two of them:

foo bar=>baz,foo,bar

baz=>baz,foo bar


was (Author: JIRAUSER297477):
If the original rule is "foo bar,baz" I believe Mikhail's suggestion to convert 
it to a directional rule would work, but we'd need two of them:

foo bar=>baz,foo,bar

baz=>foo bar

> multi-term synonym rule applied at query time prevents single-term matching
> ---
>
> Key: SOLR-16652
> URL: https://issues.apache.org/jira/browse/SOLR-16652
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 9.1
>Reporter: Rudi Seitz
>Priority: Major
>
> The presence of a multi-term synonym equivalence rule applied at query time 
> prevents matching on individual terms in the synonym.
> If we issue an edismax query against a text_general field in Solr 9.1, and 
> the query string is "foo bar," we can match documents that have "foo" without 
> "bar" and vice versa. However, if there is a synonym rule like "foo bar,baz" 
> applied at query time, we no longer get single-term matches against "foo" or 
> "bar." Both terms are now required, but can occur in any position: a document 
> can match the query if it contains "foo bar" or "bar foo" or "bar qux foo", 
> for example, but not if it only contains "foo".
> However, if we change the text_general analysis chain to apply synonyms at 
> index time, the observed behavior also changes and single-term matches for 
> "foo" or "bar" are again possible.
> Why is this an issue? 1) it is counterintuitive that a synonym equivalence 
> (as opposed to a unidirectional mapping) would give narrower recall than 
> without the rule, 2) this behavior represents a discrepancy in semantics 
> between index-time and query-time synonym expansion.
>  
> *STEPS TO REPRODUCE*
> Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
> these four docs:
>  
> {"id":"1", "title_txt":"foo"}
>  
> {"id":"2", "title_txt":"bar"}
>  
> {"id":"3", "title_txt":"foo bar"}
>  
> {"id":"4", "title_txt":"bar foo"}
>  
>  
> Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
> bar)
> Result: Only docs 3 and 4 come back
>  
> Issue a query for "bar foo"
> Result: All four docs come back; the synonym rule is not invoked
>  
> *OBSERVATIONS*
> Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
> would mean that a query for "foo" could now match a document containing only 
> "bar", which is not the intent of the original rule.
> Note that we could set sow=true but this would prevent the multi-term synonym 
> from taking effect: the "foo bar" query could now get single-term matches on 
> "foo" or "bar" but couldn't get a match on the synonym "baz"
>  
> Returning to the original "foo bar,baz" synonym rule with sow=false, if we 
> look at the explain output for the "foo bar" query we see:
> {{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
>  
> Looking at the explain output for "bar foo" we see:
> {{+((title_txt:bar) (title_txt:foo))}}
>  
> So, the observed behavior makes sense according to the low-level query 
> structure, but is still counterintuitive for the reasons described above.
>  
> Why not expand the "foo bar" query like this instead?
>  
> {{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-13 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687959#comment-17687959
 ] 

Rudi Seitz commented on SOLR-16652:
---

If the original rule is "foo bar,baz" I believe Mikhail's suggestion to convert 
it to a directional rule would work, but we'd need two of them:

foo bar=>baz,foo,bar

baz=>foo bar

> multi-term synonym rule applied at query time prevents single-term matching
> ---
>
> Key: SOLR-16652
> URL: https://issues.apache.org/jira/browse/SOLR-16652
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 9.1
>Reporter: Rudi Seitz
>Priority: Major
>
> The presence of a multi-term synonym equivalence rule applied at query time 
> prevents matching on individual terms in the synonym.
> If we issue an edismax query against a text_general field in Solr 9.1, and 
> the query string is "foo bar," we can match documents that have "foo" without 
> "bar" and vice versa. However, if there is a synonym rule like "foo bar,baz" 
> applied at query time, we no longer get single-term matches against "foo" or 
> "bar." Both terms are now required, but can occur in any position: a document 
> can match the query if it contains "foo bar" or "bar foo" or "bar qux foo", 
> for example, but not if it only contains "foo".
> However, if we change the text_general analysis chain to apply synonyms at 
> index time, the observed behavior also changes and single-term matches for 
> "foo" or "bar" are again possible.
> Why is this an issue? 1) it is counterintuitive that a synonym equivalence 
> (as opposed to a unidirectional mapping) would give narrower recall than 
> without the rule, 2) this behavior represents a discrepancy in semantics 
> between index-time and query-time synonym expansion.
>  
> *STEPS TO REPRODUCE*
> Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
> these four docs:
>  
> {"id":"1", "title_txt":"foo"}
>  
> {"id":"2", "title_txt":"bar"}
>  
> {"id":"3", "title_txt":"foo bar"}
>  
> {"id":"4", "title_txt":"bar foo"}
>  
>  
> Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
> bar)
> Result: Only docs 3 and 4 come back
>  
> Issue a query for "bar foo"
> Result: All four docs come back; the synonym rule is not invoked
>  
> *OBSERVATIONS*
> Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
> would mean that a query for "foo" could now match a document containing only 
> "bar", which is not the intent of the original rule.
> Note that we could set sow=true but this would prevent the multi-term synonym 
> from taking effect: the "foo bar" query could now get single-term matches on 
> "foo" or "bar" but couldn't get a match on the synonym "baz"
>  
> Returning to the original "foo bar,baz" synonym rule with sow=false, if we 
> look at the explain output for the "foo bar" query we see:
> {{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
>  
> Looking at the explain output for "bar foo" we see:
> {{+((title_txt:bar) (title_txt:foo))}}
>  
> So, the observed behavior makes sense according to the low-level query 
> structure, but is still counterintuitive for the reasons described above.
>  
> Why not expand the "foo bar" query like this instead?
>  
> {{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-13 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687954#comment-17687954
 ] 

Rudi Seitz commented on SOLR-16652:
---

>From [~mkhl] via us...@solr.apache.org:
{quote}Thanks for raising a ticket. Here are just two considerations:
> we could change the synonym rule to "foo bar,baz,foo,bar" but this would
mean that a query for "foo" could now match a document containing only
"bar", which is not the intent of the original rule.
Ok. The later issue can be probably fixed by directing synonyms
foo bar=>baz,foo,bar
Right, It seems like a weird band aid.

I stepped through lucene code, MUST occur for synonyms is defined
[https://github.com/apache/lucene/blob/7baa01b3c2f93e6b172e986aac8ef577a87ebceb/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L534]
Presumably, original terms could go with defaultOperator, and synonym
replacement keep MUST.
{quote}

> multi-term synonym rule applied at query time prevents single-term matching
> ---
>
> Key: SOLR-16652
> URL: https://issues.apache.org/jira/browse/SOLR-16652
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 9.1
>Reporter: Rudi Seitz
>Priority: Major
>
> The presence of a multi-term synonym equivalence rule applied at query time 
> prevents matching on individual terms in the synonym.
> If we issue an edismax query against a text_general field in Solr 9.1, and 
> the query string is "foo bar," we can match documents that have "foo" without 
> "bar" and vice versa. However, if there is a synonym rule like "foo bar,baz" 
> applied at query time, we no longer get single-term matches against "foo" or 
> "bar." Both terms are now required, but can occur in any position: a document 
> can match the query if it contains "foo bar" or "bar foo" or "bar qux foo", 
> for example, but not if it only contains "foo".
> However, if we change the text_general analysis chain to apply synonyms at 
> index time, the observed behavior also changes and single-term matches for 
> "foo" or "bar" are again possible.
> Why is this an issue? 1) it is counterintuitive that a synonym equivalence 
> (as opposed to a unidirectional mapping) would give narrower recall than 
> without the rule, 2) this behavior represents a discrepancy in semantics 
> between index-time and query-time synonym expansion.
>  
> *STEPS TO REPRODUCE*
> Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
> these four docs:
>  
> {"id":"1", "title_txt":"foo"}
>  
> {"id":"2", "title_txt":"bar"}
>  
> {"id":"3", "title_txt":"foo bar"}
>  
> {"id":"4", "title_txt":"bar foo"}
>  
>  
> Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
> bar)
> Result: Only docs 3 and 4 come back
>  
> Issue a query for "bar foo"
> Result: All four docs come back; the synonym rule is not invoked
>  
> *OBSERVATIONS*
> Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
> would mean that a query for "foo" could now match a document containing only 
> "bar", which is not the intent of the original rule.
> Note that we could set sow=true but this would prevent the multi-term synonym 
> from taking effect: the "foo bar" query could now get single-term matches on 
> "foo" or "bar" but couldn't get a match on the synonym "baz"
>  
> Returning to the original "foo bar,baz" synonym rule with sow=false, if we 
> look at the explain output for the "foo bar" query we see:
> {{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
>  
> Looking at the explain output for "bar foo" we see:
> {{+((title_txt:bar) (title_txt:foo))}}
>  
> So, the observed behavior makes sense according to the low-level query 
> structure, but is still counterintuitive for the reasons described above.
>  
> Why not expand the "foo bar" query like this instead?
>  
> {{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Updated] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16652:
--
Description: 
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foo" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

 

{"id":"1", "title_txt":"foo"}

 

{"id":"2", "title_txt":"bar"}

 

{"id":"3", "title_txt":"foo bar"}

 

{"id":"4", "title_txt":"bar foo"}

 

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
 

 

 

  was:
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

 

{"id":"1", "title_txt":"foo"}

 

{"id":"2", "title_txt":"bar"}

 

{"id":"3", "title_txt":"foo bar"}

 

{"id":"4", "title_txt":"bar foo"}

 

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz 

[jira] [Updated] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16652:
--
Description: 
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

 

{"id":"1", "title_txt":"foo"}

 

{"id":"2", "title_txt":"bar"}

 

{"id":"3", "title_txt":"foo bar"}

 

{"id":"4", "title_txt":"bar foo"}

 

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
 

 

 

  was:
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{{{"id":"1", "title_txt":"foo"} }}

{{{"id":"2", "title_txt":"bar"} }}

{{{"id":"3", "title_txt":"foo bar"} }}

{{{"id":"4", "title_txt":"bar foo"}}}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:


[jira] [Updated] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16652:
--
Description: 
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"}

{"id":"2", "title_txt":"bar"}

{"id":"3", "title_txt":"foo bar"}

{"id":"4", "title_txt":"bar foo"}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
 

 

 

  was:
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"} \{"id":"2", "title_txt":"bar"} \{"id":"3", 
"title_txt":"foo bar"} \{"id":"4", "title_txt":"bar foo"}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo 

[jira] [Updated] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16652:
--
Description: 
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{{{"id":"1", "title_txt":"foo"} }}

{{{"id":"2", "title_txt":"bar"} }}

{{{"id":"3", "title_txt":"foo bar"} }}

{{{"id":"4", "title_txt":"bar foo"}}}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
 

 

 

  was:
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"}

{"id":"2", "title_txt":"bar"}

{"id":"3", "title_txt":"foo bar"}

{"id":"4", "title_txt":"bar foo"}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz 

[jira] [Updated] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16652:
--
Description: 
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

 

*STEPS TO REPRODUCE*

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"} \{"id":"2", "title_txt":"bar"} \{"id":"3", 
"title_txt":"foo bar"} \{"id":"4", "title_txt":"bar foo"}

 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

*OBSERVATIONS*

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar)))}}
 

 

 

  was:
The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

STEPS TO REPRODUCE

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"}
{"id":"2", "title_txt":"bar"}
{"id":"3", "title_txt":"foo bar"}
{"id":"4", "title_txt":"bar foo"}
 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

OBSERVATIONS:

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 

[jira] [Created] (SOLR-16652) multi-term synonym rule applied at query time prevents single-term matching

2023-02-10 Thread Rudi Seitz (Jira)
Rudi Seitz created SOLR-16652:
-

 Summary: multi-term synonym rule applied at query time prevents 
single-term matching
 Key: SOLR-16652
 URL: https://issues.apache.org/jira/browse/SOLR-16652
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: query parsers
Affects Versions: 9.1
Reporter: Rudi Seitz


The presence of a multi-term synonym equivalence rule applied at query time 
prevents matching on individual terms in the synonym.

If we issue an edismax query against a text_general field in Solr 9.1, and the 
query string is "foo bar," we can match documents that have "foo" without "bar" 
and vice versa. However, if there is a synonym rule like "foo bar,baz" applied 
at query time, we no longer get single-term matches against "foor" or "bar." 
Both terms are now required, but can occur in any position: a document can 
match the query if it contains "foo bar" or "bar foo" or "bar qux foo", for 
example, but not if it only contains "foo".

However, if we change the text_general analysis chain to apply synonyms at 
index time, the observed behavior also changes and single-term matches for 
"foo" or "bar" are again possible.

Why is this an issue? 1) it is counterintuitive that a synonym equivalence (as 
opposed to a unidirectional mapping) would give narrower recall than without 
the rule, 2) this behavior represents a discrepancy in semantics between 
index-time and query-time synonym expansion.

STEPS TO REPRODUCE

Use the _default configset with "foo bar,baz" added to synonyms.txt. Index 
these four docs:

{"id":"1", "title_txt":"foo"}
{"id":"2", "title_txt":"bar"}
{"id":"3", "title_txt":"foo bar"}
{"id":"4", "title_txt":"bar foo"}
 
Issue a query for "foo bar" (i.e. defType=edismax=OR=title_txt=foo 
bar)
Result: Only docs 3 and 4 come back
 
Issue a query for "bar foo"
Result: All four docs come back; the synonym rule is not invoked
 

OBSERVATIONS:

Note that we could change the synonym rule to "foo bar,baz,foo,bar" but this 
would mean that a query for "foo" could now match a document containing only 
"bar", which is not the intent of the original rule.

Note that we could set sow=true but this would prevent the multi-term synonym 
from taking effect: the "foo bar" query could now get single-term matches on 
"foo" or "bar" but couldn't get a match on the synonym "baz"
 
Returning to the original "foo bar,baz" synonym rule with sow=false, if we look 
at the explain output for the "foo bar" query we see:

{{+((title_txt:baz (+title_txt:foo +title_txt:bar)))}}
 
Looking at the explain output for "bar foo" we see:

{{+((title_txt:bar) (title_txt:foo))}}
 
So, the observed behavior makes sense according to the low-level query 
structure, but is still counterintuitive for the reasons described above.
 
Why not expand the "foo bar" query like this instead?
 
{{+((title_txt:baz (title_txt:foo title_txt:bar))){color:#88}
{color}}}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-12779) Force field/term centric matching mode for multi-term synonyms with sow=false

2023-01-17 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677902#comment-17677902
 ] 

Rudi Seitz commented on SOLR-12779:
---

Linking to SOLR-16594 which contains a proposal for addressing this issue.

> Force field/term centric matching mode for multi-term synonyms with sow=false
> -
>
> Key: SOLR-12779
> URL: https://issues.apache.org/jira/browse/SOLR-12779
> Project: Solr
>  Issue Type: Improvement
>Affects Versions: 8.0
>Reporter: Amrit Sarkar
>Priority: Major
>
> As Doug Turnbull pointed out on the solr-user mailing list: 
> [https://lists.apache.org/thread.html/27590a2d8598be515b24f47f7912e074d2010910242cfdeb1fcd655d%40%3Csolr-user.lucene.apache.org%3E]
>  (recommended reading, especially for his discussion of the limitations of 
> the new sow=false request parameter), sow=false changes the queries edismax 
> produces over multiple fields when any of the fields’ query-time analysis 
> differs from the other fields’, e.g. if one field’s analyzer removes 
> stopwords when another field’s doesn’t. In this case, rather than a 
> dismax-query-per-whitespace-separated-term (edismax’s behavior when 
> sow=true), a dismax query per field is produced. This can change results in 
> general, but quite significantly when combined with the mm (min-should-match) 
> request parameter: since min-should-match applies per field instead of per 
> term, missing terms in one field’s analysis won’t disqualify docs from 
> matching. E.g. query “Terminator 100” with request param “mm=100%” against 
> both a title (text) field and a run_length (integer) field will result in the 
> following queries:
>  When sow=true:
> {code:java}
> +(DisjunctionMaxQuery((title:terminator)) 
> DisjunctionMaxQuery((run_length:[100 TO 100] | title:100)))~2{code}
> When sow=false:
> {code:java}
> +DisjunctionMaxQuery((run_length:[100 TO 100] | ((title:terminator 
> title:100)~2))){code}
> In the above scenario, when sow=true (and in versions of Solr before 6.5), 
> “terminator” must appear in documents in order to produce a match. But when 
> sow=false, a document can match if its run_length field is 100, even when the 
> title does not contain “terminator”.
> It is good to have an option to force term centric or query-centric matching 
> at query parsing; so that expected behavior can be achieved; discussed under 
> [http://lucene.472066.n3.nabble.com/Split-on-whitespace-parameter-doubt-td4404185.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650967#comment-17650967
 ] 

Rudi Seitz commented on SOLR-16594:
---

This is a rough outline of the code changes that might be needed to implement 
the proposal in this ticket:
 # Create a subclass of org.apache.lucene.index.Term that is capable of holding 
a startOffset. Possibly name it TermWithOffset
 # Update or subclass org.apache.lucene.util.QueryBuilder so that so that 
createFieldQuery() returns a Query that contains one or more TermWithOffset 
instead of simple Terms, where appropriate. This is the place where we iterate 
through the token stream and have access to the offsets to potentially store 
them on the generated Terms.
 # Update org.apache.solr.search.ExtendedDismaxQParser so that 
getAliasedMultiTermQuery() builds clauses based on startOffset instead of the 
current approach of calling allSameQueryStructure() and then doing 
"{color:#808080}Make a dismax query for each clause position in the boolean 
per-field queries"{color}

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, and then checking whether 
> all of these per-field queries have the same structure. The structure will 
> generally be uniform if each analysis chain emits the same number of tokens 
> for the given input. If one chain has a synonym filter and another doesn’t, 
> this uniformity may depend on whether a synonym rule happened to match a term 
> in the user's input.
> Assuming the per-field boolean queries _do_ have the same structure, edismax 
> reorganizes them into a new boolean query. The new query contains a dismax 
> for each clause position in the original queries. If the original queries are 
> {{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
> each, so we would get a dismax containing all the first position clauses 
> {{(f1:foo f1:bar)}} and another dismax containing all the second position 

[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and 

[jira] [Updated] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16594:
--
Description: 
When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input.

Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar) }}and{{ (f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.

Here's an example of how the proposed edismax logic would work. Let's say a 
user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
standard text analysis chain while f2 generates ngrams. We would get 
field-centric queries {{(f1:foo f1:bar)}} and 

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:37 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents 3 and 4 are returned for both versions 
of qf – there is no change in recall when we add field2_txt to qf. That is 
because there is no synonym rule for GC, so even though ws and txt fields have 
"incompatible" analysis chains they happen to generate the same number of 
tokens for this particular query and edismax is able to stay with the 
term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when 

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:34 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

{{"id":"1", "field1_ws":"XY GB"}}
{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}
{{"id":"3", "field1_ws":"XY GC"}}
{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

 

{{{"id":"1", "field1_ws":"XY GB"}}}

{{{}{"id":"2", "field1_ws":"XY", "field2_ws":"GB", 
"field2_txt":"GB"}{}}}{{{}{}}}

{{{"id":"3", "field1_ws":"XY GC"}}}

{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

 

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in 

[jira] [Comment Edited] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650895#comment-17650895
 ] 

Rudi Seitz edited comment on SOLR-16594 at 12/21/22 2:32 PM:
-

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create a collection using the default schema and index the following documents:

 

{{{"id":"1", "field1_ws":"XY GB"}}}

{{{}{"id":"2", "field1_ws":"XY", "field2_ws":"GB", 
"field2_txt":"GB"}{}}}{{{}{}}}

{{{"id":"3", "field1_ws":"XY GC"}}}

{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

 

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws]}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{[http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt]}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though ws and txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 


was (Author: JIRAUSER297477):
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create collection using the default schema and index the following documents:

{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the _ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though _ws and _txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change 

[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-21 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650895#comment-17650895
 ] 

Rudi Seitz commented on SOLR-16594:
---

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create collection using the default schema and index the following documents:

{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{http://localhost:8983/solr/test/select?defType=edismax=true=100%25=OR=XY%20GB=field1_ws%20field2_ws%20field2_txt}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the _ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though _ws and _txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 

> eDismax should use startOffset when converting per-field to per-term queries
> 
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Rudi Seitz
>Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way 

[jira] [Created] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

2022-12-20 Thread Rudi Seitz (Jira)
Rudi Seitz created SOLR-16594:
-

 Summary: eDismax should use startOffset when converting per-field 
to per-term queries
 Key: SOLR-16594
 URL: https://issues.apache.org/jira/browse/SOLR-16594
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: query parsers
Reporter: Rudi Seitz


When parsing a multi-term query that spans multiple fields, edismax sometimes 
switches from a "term-centric" to a "field-centric" approach. This creates 
inconsistent semantics for the {{mm}} or "min should match" parameter and may 
have an impact on scoring. The goal of this ticket is to improve the approach 
that edismax uses for generating term-centric queries so that edismax would 
less frequently "give up" and resort to the field-centric approach. 
Specifically, we propose that edismax should create a dismax query for each 
distinct startOffset found among the tokens emitted by the field analyzers. 
Since the relevant code in edismax works with Query objects that contain Terms, 
and since Terms do not hold the startOffset of the Token from which Term was 
derived, some plumbing work would need to be done to make the startOffsets 
available to edismax.

 

BACKGROUND:

 

If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
interpretation of the query would contain a clause for each field:

{{  (f1:foo f1:bar) (f2:foo f2:bar)}}

while a term-centric interpretation would contain a clause for each term:

{{  (f1:foo f2:foo) (f1:bar f2:bar)}}

The challenge in generating a term-centric query is that we need to take the 
tokens that emerge from each field's analysis chain and group them according to 
the terms in the user's original query. However, the tokens that emerge from an 
analysis chain do not store a reference to their corresponding input terms. For 
example, if we pass "foo bar" through an ngram analyzer we would get a token 
stream containing "f", "fo", "foo", "b", "ba", "bar". While it may be obvious 
to a human that "f", "fo", and "foo" all come from the "foo" input term, and 
that "b", "ba", and "bar" come from the "bar" input term, there is not always 
an easy way for edismax to see this connection. When {{{}sow=true{}}}, edismax 
passes each whitespace-separated term through each analysis chain separately, 
and therefore edismax "knows" that the output tokens from any given analysis 
chain are all derived from the single input term that was passed into that 
chain. However, when {{{}sow=false{}}}, edismax passes the entire multi-term 
query through each analysis chain as a whole, resulting in multiple output 
tokens that are not "connected" to their source term.

Edismax still tries to generate a term-centric query when {{sow=false}} by 
first generating a boolean query for each field, and then checking whether all 
of these per-field queries have the same structure. The structure will 
generally be uniform if each analysis chain emits the same number of tokens for 
the given input. If one chain has a synonym filter and another doesn’t, this 
uniformity may depend on whether a synonym rule happened to match a term in the 
user's input. 


Assuming the per-field boolean queries _do_ have the same structure, edismax 
reorganizes them into a new boolean query. The new query contains a dismax for 
each clause position in the original queries. If the original queries are 
{{(f1:foo f1:bar) }}and {{(f2:foo f2:bar)}} we can see they have two clauses 
each, so we would get a dismax containing all the first position clauses 
{{(f1:foo f1:bar)}} and another dismax containing all the second position 
clauses {{{}(f2:foo f2:bar){}}}.

We can see that edismax is using clause position as a heuristic to reorganize 
the per-field boolean queries into per-term ones, even though it doesn't know 
for sure which clauses inside those per-field boolean queries are related to 
which input terms. We propose that a better way of reorganizing the per-field 
boolean queries is to create a dismax for each distinct startOffset seen among 
the tokens in the token streams emitted by each field analyzer. The startOffset 
of a token (rather, a PackedTokenAttributeImpl) is "the position of the first 
character corresponding to this token in the source text".

We propose that startOffset is a resonable way of matching output tokens up 
with the input terms that gave rise to them. For example, if we pass "foo bar" 
through an ngram analysis chain we see that the foo-related tokens all have 
startOffset=0 while the bar-related tokens all have startOffset=4. Likewise, 
tokens that are generated via synonym expansion have a startOffset that points 
to the beginning of the matching input term. For example, if the query "GB" 
generates "GB gib gigabyte gigabytes" via synonym expansion, all of those four 
tokens would have startOffset=0.


[jira] [Commented] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-11-02 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627788#comment-17627788
 ] 

Rudi Seitz commented on SOLR-16496:
---

Thanks, [~dsmiley], I've updated the PR based on your feedback. Added support 
for tagging individual filters using LocalParams syntax and then specifying the 
tags to exclude via {{elevate.excludeTags}}

Does the updated PR match what you had in mind?

A specific question came up: To create the set of filters to exclude based on 
their tags, I [adapted some 
code|https://github.com/apache/solr/pull/1154/commits/2a97643ce758029e5d68978eb0633fced927b515#diff-26b681890de6a262a6c94485956aa1ada9d2bff306dabe79962ec459843f76faR599]
 from 
[FacetProcessor#handleFilterExclusions()|https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/facet/FacetProcessor.java#L192]
 I thought about abstracting it into a common utility method but wanted to 
avoid touching FacetProcessor for now. What would you recommend re: next steps?

> provide option for Query Elevation Component to bypass filters
> --
>
> Key: SOLR-16496
> URL: https://issues.apache.org/jira/browse/SOLR-16496
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Reporter: Rudi Seitz
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Query Elevation Component respects the fq parameter.
> A document listed in elevate.xml or specified via the {{elevateIds}} 
> parameter must match the provided filter queries in order to be included in 
> the result set for a given query. Documents that don't match the filter 
> queries will be excluded regardless of whether they are supposed to be 
> "elevated."
> In some cases, this behavior is desirable; in other cases, it is not. For 
> example, an ecommerce landing page might filter products according to whether 
> they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
> promoted products regardless of inventory.
> This ticket asks for an {{elevateFilteredDocs}} parameter that could be set 
> to true to include elevated documents in the result set regardless of whether 
> they match the provided filter queries. The default would be false, in 
> accordance with the current behavior.
> This parameter would allow elevated documents to "bypass" the provided 
> filters, while keeping the filters in place for non-elevated documents.
> From an implementation standpoint, this parameter could be supported with 
> code in {{QueryElevationComponent#setQuery}} that updates the filter queries 
> in similar way to how the main query is updated. When 
> {{{}elevateFilteredDocs=true{}}}, each filter query would become a boolean 
> "OR" of the original filter query with a second query matching the elevated 
> documents by id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-11-01 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627263#comment-17627263
 ] 

Rudi Seitz commented on SOLR-16496:
---

Thanks for the feedback [~dsmiley] 

 

Here is the ref-guide example of how to exclude filters selectively while 
faceting:

{{q=mainquery=status:public=\{!tag=dt}doctype:pdf=true=\{!ex=dt}doctype}}

 

In the case of elevation, I agree, the same approach to tagging a filter would 
work nicely ({{{}fq=\{!tag=dt}doctype:pdf{}}}).

But do you have an opinion on how the QEC-specific parameter to exclude some 
tags should look?

 

In the case of faceting, the tags to exclude are specified via the local 
parameter ex, as in: {{facet.field=\{!ex=dt}doctype}}

 

In the case of elevation, I don't think there's an obvious query parameter to 
which a local parameter like ex could be attached. I also don't think "ex" is 
appropriate as a name for the parameter because, in the elevation scenario, the 
filters won't be "excluded" or "bypassed" entirely but rather 
updated/modified/broadened to include the elevated documents.

 

What do you think about providing the tags via a top-level query parameter like 
this?

{{q=mainquery=status:public=\{!tag=dt}doctype:pdf=dt}}

 

> provide option for Query Elevation Component to bypass filters
> --
>
> Key: SOLR-16496
> URL: https://issues.apache.org/jira/browse/SOLR-16496
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Reporter: Rudi Seitz
>Priority: Major
>
> The Query Elevation Component respects the fq parameter.
> A document listed in elevate.xml or specified via the {{elevateIds}} 
> parameter must match the provided filter queries in order to be included in 
> the result set for a given query. Documents that don't match the filter 
> queries will be excluded regardless of whether they are supposed to be 
> "elevated."
> In some cases, this behavior is desirable; in other cases, it is not. For 
> example, an ecommerce landing page might filter products according to whether 
> they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
> promoted products regardless of inventory.
> This ticket asks for an {{elevateFilteredDocs}} parameter that could be set 
> to true to include elevated documents in the result set regardless of whether 
> they match the provided filter queries. The default would be false, in 
> accordance with the current behavior.
> This parameter would allow elevated documents to "bypass" the provided 
> filters, while keeping the filters in place for non-elevated documents.
> From an implementation standpoint, this parameter could be supported with 
> code in {{QueryElevationComponent#setQuery}} that updates the filter queries 
> in similar way to how the main query is updated. When 
> {{{}elevateFilteredDocs=true{}}}, each filter query would become a boolean 
> "OR" of the original filter query with a second query matching the elevated 
> documents by id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-10-31 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626855#comment-17626855
 ] 

Rudi Seitz commented on SOLR-16496:
---

PR here: https://github.com/apache/solr/pull/1154

> provide option for Query Elevation Component to bypass filters
> --
>
> Key: SOLR-16496
> URL: https://issues.apache.org/jira/browse/SOLR-16496
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Reporter: Rudi Seitz
>Priority: Major
>
> The Query Elevation Component respects the fq parameter.
> A document listed in elevate.xml or specified via the {{elevateIds}} 
> parameter must match the provided filter queries in order to be included in 
> the result set for a given query. Documents that don't match the filter 
> queries will be excluded regardless of whether they are supposed to be 
> "elevated."
> In some cases, this behavior is desirable; in other cases, it is not. For 
> example, an ecommerce landing page might filter products according to whether 
> they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
> promoted products regardless of inventory.
> This ticket asks for an {{elevateFilteredDocs}} parameter that could be set 
> to true to include elevated documents in the result set regardless of whether 
> they match the provided filter queries. The default would be false, in 
> accordance with the current behavior.
> This parameter would allow elevated documents to "bypass" the provided 
> filters, while keeping the filters in place for non-elevated documents.
> From an implementation standpoint, this parameter could be supported with 
> code in {{QueryElevationComponent#setQuery}} that updates the filter queries 
> in similar way to how the main query is updated. When 
> {{{}elevateFilteredDocs=true{}}}, each filter query would become a boolean 
> "OR" of the original filter query with a second query matching the elevated 
> documents by id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Updated] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-10-25 Thread Rudi Seitz (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rudi Seitz updated SOLR-16496:
--
Description: 
The Query Elevation Component respects the fq parameter.

A document listed in elevate.xml or specified via the {{elevateIds}} parameter 
must match the provided filter queries in order to be included in the result 
set for a given query. Documents that don't match the filter queries will be 
excluded regardless of whether they are supposed to be "elevated."

In some cases, this behavior is desirable; in other cases, it is not. For 
example, an ecommerce landing page might filter products according to whether 
they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
promoted products regardless of inventory.

This ticket asks for an {{elevateFilteredDocs}} parameter that could be set to 
true to include elevated documents in the result set regardless of whether they 
match the provided filter queries. The default would be false, in accordance 
with the current behavior.

This parameter would allow elevated documents to "bypass" the provided filters, 
while keeping the filters in place for non-elevated documents.

>From an implementation standpoint, this parameter could be supported with code 
>in {{QueryElevationComponent#setQuery}} that updates the filter queries in 
>similar way to how the main query is updated. When 
>{{{}elevateFilteredDocs=true{}}}, each filter query would become a boolean 
>"OR" of the original filter query with a second query matching the elevated 
>documents by id.

  was:
The Query Elevation Component respects the fq parameter. 

A document listed in elevate.xml or specified via the {{elevateIds}} parameter 
must match the provided filter queries in order to be included in the result 
set for a given query. Documents that don't match the filter queries will be 
excluded regardless of whether they are supposed to be "elevated."

In some cases, this behavior is desirable; in other cases, it is not. For 
example, an ecommerce landing page might filter products according to whether 
they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
promoted products regardless of inventory.

This ticket asks for an {{elevateFilteredDocuments}} parameter that could be 
set to true to include elevated documents in the result set regardless of 
whether they match the provided filter queries. The default would be false, in 
accordance with the current behavior.

This parameter would allow elevated documents to "bypass" the provided filters, 
while keeping the filters in place for non-elevated documents.

>From an implementation standpoint, this parameter could be supported with code 
>in {{QueryElevationComponent#setQuery}} that updates the filter queries in 
>similar way to how the main query is updated. When 
>{{{}elevateFilteredDocuments=true{}}}, each filter query would become a 
>boolean "OR" of the original filter query with a second query matching the 
>elevated documents by id.


> provide option for Query Elevation Component to bypass filters
> --
>
> Key: SOLR-16496
> URL: https://issues.apache.org/jira/browse/SOLR-16496
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Reporter: Rudi Seitz
>Priority: Major
>
> The Query Elevation Component respects the fq parameter.
> A document listed in elevate.xml or specified via the {{elevateIds}} 
> parameter must match the provided filter queries in order to be included in 
> the result set for a given query. Documents that don't match the filter 
> queries will be excluded regardless of whether they are supposed to be 
> "elevated."
> In some cases, this behavior is desirable; in other cases, it is not. For 
> example, an ecommerce landing page might filter products according to whether 
> they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
> promoted products regardless of inventory.
> This ticket asks for an {{elevateFilteredDocs}} parameter that could be set 
> to true to include elevated documents in the result set regardless of whether 
> they match the provided filter queries. The default would be false, in 
> accordance with the current behavior.
> This parameter would allow elevated documents to "bypass" the provided 
> filters, while keeping the filters in place for non-elevated documents.
> From an implementation standpoint, this parameter could be supported with 
> code in {{QueryElevationComponent#setQuery}} that updates the filter queries 
> in similar way to how the main query is updated. When 
> {{{}elevateFilteredDocs=true{}}}, each filter query would become a boolean 
> "OR" of the original filter query with a 

[jira] [Commented] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-10-25 Thread Rudi Seitz (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624016#comment-17624016
 ] 

Rudi Seitz commented on SOLR-16496:
---

I've begun implementing this request here: 
[https://github.com/rseitz/solr/tree/SOLR-16496]

> provide option for Query Elevation Component to bypass filters
> --
>
> Key: SOLR-16496
> URL: https://issues.apache.org/jira/browse/SOLR-16496
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Reporter: Rudi Seitz
>Priority: Major
>
> The Query Elevation Component respects the fq parameter. 
> A document listed in elevate.xml or specified via the {{elevateIds}} 
> parameter must match the provided filter queries in order to be included in 
> the result set for a given query. Documents that don't match the filter 
> queries will be excluded regardless of whether they are supposed to be 
> "elevated."
> In some cases, this behavior is desirable; in other cases, it is not. For 
> example, an ecommerce landing page might filter products according to whether 
> they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
> promoted products regardless of inventory.
> This ticket asks for an {{elevateFilteredDocuments}} parameter that could be 
> set to true to include elevated documents in the result set regardless of 
> whether they match the provided filter queries. The default would be false, 
> in accordance with the current behavior.
> This parameter would allow elevated documents to "bypass" the provided 
> filters, while keeping the filters in place for non-elevated documents.
> From an implementation standpoint, this parameter could be supported with 
> code in {{QueryElevationComponent#setQuery}} that updates the filter queries 
> in similar way to how the main query is updated. When 
> {{{}elevateFilteredDocuments=true{}}}, each filter query would become a 
> boolean "OR" of the original filter query with a second query matching the 
> elevated documents by id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Created] (SOLR-16496) provide option for Query Elevation Component to bypass filters

2022-10-25 Thread Rudi Seitz (Jira)
Rudi Seitz created SOLR-16496:
-

 Summary: provide option for Query Elevation Component to bypass 
filters
 Key: SOLR-16496
 URL: https://issues.apache.org/jira/browse/SOLR-16496
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SearchComponents - other
Reporter: Rudi Seitz


The Query Elevation Component respects the fq parameter. 

A document listed in elevate.xml or specified via the {{elevateIds}} parameter 
must match the provided filter queries in order to be included in the result 
set for a given query. Documents that don't match the filter queries will be 
excluded regardless of whether they are supposed to be "elevated."

In some cases, this behavior is desirable; in other cases, it is not. For 
example, an ecommerce landing page might filter products according to whether 
they are in stock ({{{}fq=inStock:true{}}}) but might wish to show certain 
promoted products regardless of inventory.

This ticket asks for an {{elevateFilteredDocuments}} parameter that could be 
set to true to include elevated documents in the result set regardless of 
whether they match the provided filter queries. The default would be false, in 
accordance with the current behavior.

This parameter would allow elevated documents to "bypass" the provided filters, 
while keeping the filters in place for non-elevated documents.

>From an implementation standpoint, this parameter could be supported with code 
>in {{QueryElevationComponent#setQuery}} that updates the filter queries in 
>similar way to how the main query is updated. When 
>{{{}elevateFilteredDocuments=true{}}}, each filter query would become a 
>boolean "OR" of the original filter query with a second query matching the 
>elevated documents by id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org