[ 
https://issues.apache.org/jira/browse/SOLR-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456929#comment-16456929
 ] 

Elizabeth Haubert edited comment on SOLR-12243 at 4/27/18 7:26 PM:
-------------------------------------------------------------------

The fix I pushed up really only handles the case where you're starting with the 
single-word synonym well for pf2.  So matching "foo bar" queries to "foo 
tropical cyclone" documents.  This was a real problem for my use case, because 
the pf clauses weren't being generated at all.

The other direction, to match "foo tropical cyclone" queries to "foo bar" 
documents is harder.   I've gone a little ways into the pf2 "b tropical" 
problem, but it is a deeper problem than the spans getting thrown out because 
they were the wrong type of query. Start small.

Here's what I've got for the other direction:

One of first thing edismax does is generate a list of different kinds of 
clauses off the user query, and that seems to be unaffected by the sow flag. So 
"foo tropical cyclone" has three bareword clauses: "foo", "tropical", and 
"cyclone". But 'foo "tropical cyclone"' (with quotes) has two: a bareword foo 
and a phrase "tropical cyclone".   When it goes to construct pf2 and pf3, 
edismax re-assembles the bareword clauses, then makes the 2- and 3- word 
shingles. So "foo tropical cyclone" would get pf2="foo tropical" and "tropical 
cyclone", pf2="foo tropical" can't get expanded, because it is missing cyclone, 
and will go through such as it is;  "tropical cyclone" will get expanded, but 
then removed as not a phrase, not just because it is a Span.  That seems 
consistent if we think of "tropical cyclone" as a single entity.  So to do 
anything, we need to address how the shingle queries are being constructed.

 

I opened Jira-12260 to start looping in the phrases to pf clauses, not just the 
barewords, because that has some other weird semantics.  So 'foo "tropical 
cyclone" baz' (with quotes) would generate pf="foo baz", which is unintuitive - 
it would make more sense if it became "foo "tropical cyclone"" and "tropical 
cyclone" baz. Beyond looking a little into whether the graph queries could 
handle the phrase, I haven't really dug how to do that yet.

That matters here, because if that works and the semantics are acceptable, 
multi-word synoynms are already handled as quoted in the logic that creates the 
graph queries.   So it would (probably) be safe to take that another step to 
stuff the multiword synonyms into a single phrase clause, rather than 
individual bareword clauses.  Maybe.

 

 

 

 


was (Author: ehaubert):
The fix I pushed up really only handles the case where you're starting with the 
single-word synonym well.  So matching "foo bar" queries to "foo tropical 
cyclone" documents.  This was a real problem for my use case, because the pf 
clauses weren't being generated at all.

The other direction, to match "foo tropical cyclone" queries to "foo bar" 
documents is harder.   I've gone a little ways into the pf2 "b tropical" 
problem, but it is a deeper problem than the spans getting thrown out because 
they were the wrong type of query. Start small.

Here's what I've got for the other direction:

One of first thing edismax does is generate a list of different kinds of 
clauses off the user query, and that seems to be unaffected by the sow flag. So 
"foo tropical cyclone" has three bareword clauses: "foo", "tropical", and 
"cyclone". But 'foo "tropical cyclone"' (with quotes) has two: a bareword foo 
and a phrase "tropical cyclone".   When it goes to construct pf2 and pf3, 
edismax re-assembles the bareword clauses, then makes the 2- and 3- word 
shingles. So "foo tropical cyclone" would get pf2="foo tropical" and "tropical 
cyclone", pf2="foo tropical" can't get expanded, because it is missing cyclone, 
and will go through such as it is;  "tropical cyclone" will get expanded, but 
then removed as not a phrase, not just because it is a Span.  That seems 
consistent if we think of "tropical cyclone" as a single entity.  So to do 
anything, we need to address how the shingle queries are being constructed.

 

I opened Jira-12260 to start looping in the phrases to pf clauses, not just the 
barewords, because that has some other weird semantics.  So 'foo "tropical 
cyclone" baz' (with quotes) would generate pf="foo baz", which is unintuitive - 
it would make more sense if it became "foo "tropical cyclone"" and "tropical 
cyclone" baz. Beyond looking a little into whether the graph queries could 
handle the phrase, I haven't really dug how to do that yet.

That matters here, because if that works and the semantics are acceptable, 
multi-word synoynms are already handled as quoted in the logic that creates the 
graph queries.   So it would (probably) be safe to take that another step to 
stuff the multiword synonyms into a single phrase clause, rather than 
individual bareword clauses.  Maybe.

 

 

 

 

> Edismax missing phrase queries when phrases contain multiterm synonyms
> ----------------------------------------------------------------------
>
>                 Key: SOLR-12243
>                 URL: https://issues.apache.org/jira/browse/SOLR-12243
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>    Affects Versions: 7.1
>         Environment: RHEL, MacOS X
> Do not believe this is environment-specific.
>            Reporter: Elizabeth Haubert
>            Priority: Major
>         Attachments: SOLR-12243.patch
>
>
> synonyms.txt:
> allergic, hypersensitive
> aspirin, acetylsalicylic acid
> dog, canine, canis familiris, k 9
> rat, rattus
> request handler:
> <requestHandler name="/test_qparse_error" class="solr.SearchHandler">
>  <lst name="defaults">
> <!-- Query settings -->
>  <str name="defType">edismax</str>
>  <str name="tie"> 0.4</str>
>  <str name="qf">title^100</str>
>  <str name="pf">title~20^5000</str>
>  <str name="pf2">title~11</str>
>  <str name="pf3">title~22^1000</str>
>  <str name="df">text</str>
>  <!-- mm If two or fewer clauses exist, they all must match. 
>  If three to five clauses exist, one can be missing. If six to eight clauses 
> exist, all but three must match. 
>  If more than nine clauses exist, only require 30% to match.-->
>  <str name="mm">3&lt;-1 6&lt;-3 9&lt;30%</str>
>  <str name="q.alt">*:*</str>
>  <str name="rows">25</str>
> </lst>
>  </requestHandler>
> Phrase queries (pf, pf2, pf3) containing "dog" or "aspirin"  against the 
> above list will not be generated.
> "allergic reaction dog" will generate pf2: "allergic reaction", but not 
> pf:"allergic reaction dog", pf2: "reaction dog", or pf3: "allergic reaction 
> dog"
> "aspirin dose in rats" will generate pf3: "dose ? rats" but not pf2: "aspirin 
> dose" or pf3:"aspirin dose ?"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to