Re: Query generation is different for search terms with and without "-"

Erick Erickson Wed, 25 Nov 2020 09:27:39 -0800

Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:

*FilterFactory are _not_ what you want in this case, they are applied to 
individual tokens after parsing


*CharFiterFactory are invoked on the entire input to the field, although I 
can’t say for certain that even that’s early enough.

There are two other options to consider:
StatelessScriptUpdateProcessor
FieldMutatingUpdateProcessor

Stateless... is probably easiest…

Best,
ERick

> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
> <samuel.gutier...@iherb.com.INVALID> wrote:
> 
> Are there any good workarounds/parameters we can use to fix this so it
> doesn't have to be solved client side?
> 
> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <msporle...@gmail.com>
> wrote:
> 
>> Is the normal/standard solution here to regex remove the '-'s and
>> combine them into a single token?
>> 
>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>> 
>>> This is a common point of confusion. There are two phases for creating a
>> query,
>>> query _parsing_ first, then the analysis chain for the parsed result.
>>> 
>>> So what e-dismax sees in the two cases is:
>>> 
>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>> comes into play.
>>> 
>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>> splitting it on the hyphen comes later.
>>> 
>>> It’s especially confusing since the field analysis then breaks up
>> “high-tech” into two tokens that
>>> look the same as “high tech” in the debug response, just without the
>> phrase query.
>>> 
>>> Name_enUS:high
>>> Name_enUS:tech
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>> samuel.gutier...@iherb.com.INVALID> wrote:
>>>> 
>>>> I am troubleshooting an issue with ranking for search terms that
>> contain a
>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>> vs
>>>> "high tech". The field that I am querying is using the standard
>> tokenizer,
>>>> so I would expect that the underlying lucene query should be the same
>> for
>>>> both versions of the query, however when printing the debug, it appears
>>>> they are generated differently. I know "-" must be escaped as it has
>>>> special meaning in lucene, however escaping does not fix the problem.
>> It
>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>> respected and omitted from the final query. We use sow=false as we have
>>>> multiterm synonyms and need to ensure they are included in the final
>> lucene
>>>> query. My expectation is that the final underlying lucene query should
>> be
>>>> based on the output  of the field analyzer, however after briefly
>> looking
>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>> string
>>>> processing happening outside of the analysis step which causes the
>>>> unexpected lucene query.
>>>> 
>>>> 
>>>> Solr Debug for "high tech":
>>>> 
>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>> (Name_enUS:"high tech"~4)~0.4",
>>>> 
>>>> 
>>>> Solr Debug for "high-tech"
>>>> 
>>>> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
>>>> tech"~5)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
>>>> (Name_enUS:"high tech"~5)~0.4"
>>>> 
>>>> SolrConfig:
>>>> 
>>>> <requestHandler name="/search" class="solr.SearchHandler">
>>>>   <lst name="defaults">
>>>>     <str name="omitHeader">true</str>
>>>>     <str name="indent">true</str>
>>>>     <str name="wt">json</str>
>>>>     <str name="mm">3&lt;75%</str>
>>>>     <str name="qf">Name_enUS</str>
>>>>     <str name="pf">Name_enUS</str>
>>>>     <str name="ps">5</str>    <!---->
>>>>     <str name="pf2">Name_enUS</str>
>>>>     <str name="ps2">4</str>   <!---->
>>>>     <str name="qs">3</str>    <!---->
>>>>     <str name="tie">0.4</str>
>>>>     <str name="echoParams">explicit</str>
>>>>     <int name="rows">100</int>
>>>>     <str name="sow">false</str>
>>>>   </lst>
>>>>   <lst name="invariants">
>>>>     <str name="defType">edismax</str>
>>>>   </lst>
>>>> </requestHandler>
>>>> 
>>>> Schema:
>>>> 
>>>> <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>>>>     <analyzer>
>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>       <filter class="solr.SnowballPorterFilterFactory"/>
>>>>     </analyzer>
>>>> </fieldType>
>>>> 
>>>> 
>>>> Using Solr 8.6.3
>>>> 
>> 
> 
> -- 
> *The information contained in this message is the sole and exclusive 
> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
> not be disseminated or distributed to persons or entities other than the 
> ones intended without the written authority of ***iHerb Inc.** *If you have 
> received this e-mail in error or are not the intended recipient, you may 
> not use, copy, disseminate or distribute it. Do not open any attachments. 
> Please delete it immediately from your system and notify the sender 
> promptly by e-mail that you have done so.*

Re: Query generation is different for search terms with and without "-"

Reply via email to