Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are working with a fixed vocabulary (movie titles, product names), that can work just fine.
babysitter, baby-sitter, baby sitter fullmetal, full-metal, full metal manhunter, man-hunter, man hunter spiderman, spider-man, spider man wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 25, 2020, at 9:26 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE: > > *FilterFactory are _not_ what you want in this case, they are applied to > individual tokens after parsing > > *CharFiterFactory are invoked on the entire input to the field, although I > can’t say for certain that even that’s early enough. > > There are two other options to consider: > StatelessScriptUpdateProcessor > FieldMutatingUpdateProcessor > > Stateless... is probably easiest… > > Best, > ERick > >> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez >> <samuel.gutier...@iherb.com.INVALID> wrote: >> >> Are there any good workarounds/parameters we can use to fix this so it >> doesn't have to be solved client side? >> >> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder <msporle...@gmail.com> >> wrote: >> >>> Is the normal/standard solution here to regex remove the '-'s and >>> combine them into a single token? >>> >>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <erickerick...@gmail.com> >>> wrote: >>>> >>>> This is a common point of confusion. There are two phases for creating a >>> query, >>>> query _parsing_ first, then the analysis chain for the parsed result. >>>> >>>> So what e-dismax sees in the two cases is: >>>> >>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2 >>> comes into play. >>>> >>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, >>> splitting it on the hyphen comes later. >>>> >>>> It’s especially confusing since the field analysis then breaks up >>> “high-tech” into two tokens that >>>> look the same as “high tech” in the debug response, just without the >>> phrase query. >>>> >>>> Name_enUS:high >>>> Name_enUS:tech >>>> >>>> Best, >>>> Erick >>>> >>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez < >>> samuel.gutier...@iherb.com.INVALID> wrote: >>>>> >>>>> I am troubleshooting an issue with ranking for search terms that >>> contain a >>>>> "-" vs the same query that does not contain the dash e.g. "high-tech" >>> vs >>>>> "high tech". The field that I am querying is using the standard >>> tokenizer, >>>>> so I would expect that the underlying lucene query should be the same >>> for >>>>> both versions of the query, however when printing the debug, it appears >>>>> they are generated differently. I know "-" must be escaped as it has >>>>> special meaning in lucene, however escaping does not fix the problem. >>> It >>>>> appears that with the "-" present, the pf2 edismax parameter is not >>>>> respected and omitted from the final query. We use sow=false as we have >>>>> multiterm synonyms and need to ensure they are included in the final >>> lucene >>>>> query. My expectation is that the final underlying lucene query should >>> be >>>>> based on the output of the field analyzer, however after briefly >>> looking >>>>> at the code for ExtendedDismaxQParser, it appears that there is some >>> string >>>>> processing happening outside of the analysis step which causes the >>>>> unexpected lucene query. >>>>> >>>>> >>>>> Solr Debug for "high tech": >>>>> >>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4) >>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2 >>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4) >>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)", >>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4 >>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4 >>>>> (Name_enUS:"high tech"~4)~0.4", >>>>> >>>>> >>>>> Solr Debug for "high-tech" >>>>> >>>>> parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high >>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high >>>>> tech"~5)~0.4)", >>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4 >>>>> (Name_enUS:"high tech"~5)~0.4" >>>>> >>>>> SolrConfig: >>>>> >>>>> <requestHandler name="/search" class="solr.SearchHandler"> >>>>> <lst name="defaults"> >>>>> <str name="omitHeader">true</str> >>>>> <str name="indent">true</str> >>>>> <str name="wt">json</str> >>>>> <str name="mm">3<75%</str> >>>>> <str name="qf">Name_enUS</str> >>>>> <str name="pf">Name_enUS</str> >>>>> <str name="ps">5</str> <!----> >>>>> <str name="pf2">Name_enUS</str> >>>>> <str name="ps2">4</str> <!----> >>>>> <str name="qs">3</str> <!----> >>>>> <str name="tie">0.4</str> >>>>> <str name="echoParams">explicit</str> >>>>> <int name="rows">100</int> >>>>> <str name="sow">false</str> >>>>> </lst> >>>>> <lst name="invariants"> >>>>> <str name="defType">edismax</str> >>>>> </lst> >>>>> </requestHandler> >>>>> >>>>> Schema: >>>>> >>>>> <fieldType name="text_en" class="solr.TextField" >>> positionIncrementGap="100"> >>>>> <analyzer> >>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> <filter class="solr.EnglishPossessiveFilterFactory"/> >>>>> <filter class="solr.SnowballPorterFilterFactory"/> >>>>> </analyzer> >>>>> </fieldType> >>>>> >>>>> >>>>> Using Solr 8.6.3 >>>>> >>> >> >> -- >> *The information contained in this message is the sole and exclusive >> property of ***iHerb Inc.*** and may be privileged and confidential. It may >> not be disseminated or distributed to persons or entities other than the >> ones intended without the written authority of ***iHerb Inc.** *If you have >> received this e-mail in error or are not the intended recipient, you may >> not use, copy, disseminate or distribute it. Do not open any attachments. >> Please delete it immediately from your system and notify the sender >> promptly by e-mail that you have done so.* >