Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are 
working with
a fixed vocabulary (movie titles, product names), that can work just fine.

babysitter, baby-sitter, baby sitter
fullmetal, full-metal, full metal
manhunter, man-hunter, man hunter
spiderman, spider-man, spider man

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 25, 2020, at 9:26 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:
> 
> *FilterFactory are _not_ what you want in this case, they are applied to 
> individual tokens after parsing
> 
> *CharFiterFactory are invoked on the entire input to the field, although I 
> can’t say for certain that even that’s early enough.
> 
> There are two other options to consider:
> StatelessScriptUpdateProcessor
> FieldMutatingUpdateProcessor
> 
> Stateless... is probably easiest…
> 
> Best,
> ERick
> 
>> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
>> <samuel.gutier...@iherb.com.INVALID> wrote:
>> 
>> Are there any good workarounds/parameters we can use to fix this so it
>> doesn't have to be solved client side?
>> 
>> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <msporle...@gmail.com>
>> wrote:
>> 
>>> Is the normal/standard solution here to regex remove the '-'s and
>>> combine them into a single token?
>>> 
>>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>> 
>>>> This is a common point of confusion. There are two phases for creating a
>>> query,
>>>> query _parsing_ first, then the analysis chain for the parsed result.
>>>> 
>>>> So what e-dismax sees in the two cases is:
>>>> 
>>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>>> comes into play.
>>>> 
>>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>>> splitting it on the hyphen comes later.
>>>> 
>>>> It’s especially confusing since the field analysis then breaks up
>>> “high-tech” into two tokens that
>>>> look the same as “high tech” in the debug response, just without the
>>> phrase query.
>>>> 
>>>> Name_enUS:high
>>>> Name_enUS:tech
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>>> samuel.gutier...@iherb.com.INVALID> wrote:
>>>>> 
>>>>> I am troubleshooting an issue with ranking for search terms that
>>> contain a
>>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>>> vs
>>>>> "high tech". The field that I am querying is using the standard
>>> tokenizer,
>>>>> so I would expect that the underlying lucene query should be the same
>>> for
>>>>> both versions of the query, however when printing the debug, it appears
>>>>> they are generated differently. I know "-" must be escaped as it has
>>>>> special meaning in lucene, however escaping does not fix the problem.
>>> It
>>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>>> respected and omitted from the final query. We use sow=false as we have
>>>>> multiterm synonyms and need to ensure they are included in the final
>>> lucene
>>>>> query. My expectation is that the final underlying lucene query should
>>> be
>>>>> based on the output  of the field analyzer, however after briefly
>>> looking
>>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>>> string
>>>>> processing happening outside of the analysis step which causes the
>>>>> unexpected lucene query.
>>>>> 
>>>>> 
>>>>> Solr Debug for "high tech":
>>>>> 
>>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>>> (Name_enUS:"high tech"~4)~0.4",
>>>>> 
>>>>> 
>>>>> Solr Debug for "high-tech"
>>>>> 
>>>>> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
>>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
>>>>> tech"~5)~0.4)",
>>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
>>>>> (Name_enUS:"high tech"~5)~0.4"
>>>>> 
>>>>> SolrConfig:
>>>>> 
>>>>> <requestHandler name="/search" class="solr.SearchHandler">
>>>>>  <lst name="defaults">
>>>>>    <str name="omitHeader">true</str>
>>>>>    <str name="indent">true</str>
>>>>>    <str name="wt">json</str>
>>>>>    <str name="mm">3&lt;75%</str>
>>>>>    <str name="qf">Name_enUS</str>
>>>>>    <str name="pf">Name_enUS</str>
>>>>>    <str name="ps">5</str>    <!---->
>>>>>    <str name="pf2">Name_enUS</str>
>>>>>    <str name="ps2">4</str>   <!---->
>>>>>    <str name="qs">3</str>    <!---->
>>>>>    <str name="tie">0.4</str>
>>>>>    <str name="echoParams">explicit</str>
>>>>>    <int name="rows">100</int>
>>>>>    <str name="sow">false</str>
>>>>>  </lst>
>>>>>  <lst name="invariants">
>>>>>    <str name="defType">edismax</str>
>>>>>  </lst>
>>>>> </requestHandler>
>>>>> 
>>>>> Schema:
>>>>> 
>>>>> <fieldType name="text_en" class="solr.TextField"
>>> positionIncrementGap="100">
>>>>>    <analyzer>
>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"/>
>>>>>    </analyzer>
>>>>> </fieldType>
>>>>> 
>>>>> 
>>>>> Using Solr 8.6.3
>>>>> 
>>> 
>> 
>> -- 
>> *The information contained in this message is the sole and exclusive 
>> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
>> not be disseminated or distributed to persons or entities other than the 
>> ones intended without the written authority of ***iHerb Inc.** *If you have 
>> received this e-mail in error or are not the intended recipient, you may 
>> not use, copy, disseminate or distribute it. Do not open any attachments. 
>> Please delete it immediately from your system and notify the sender 
>> promptly by e-mail that you have done so.*
> 

Reply via email to