I’ve learned these things the hard way from weird behavior in production, mostly due to my own mistakes.
I had to debug some really strange results from my configs at Netflix. It turns out that you don’t want the movie “Saw” to match “see”, for example. :-) And there were several movie titles that completely disappeared after stopword removal. Oops. I wrote that up here: https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ My favorite was "To Be and To Have (Être et Avoir)” which is all-stopwords in two languages. A great movie, too. The biggest hassle was a movie titled “+/-“, but that is a different problem. wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) > On Sep 27, 2022, at 12:49 PM, Miguel Joy <[email protected]> > wrote: > > Hi Walter, > > Thanks very much for your honest feedback. As I mentioned, I inherited this > application so I've been trying to pick up the pieces as best I can. The > solr analysis tool is great, so it's now clear to me how to make changes to > the analysis chain and test them using the analysis tool. I suspect we'll > end up having to clean this configuration up and re-index the documents. > Again, thanks to you and Markus for the support. I have what I need now. > > -Miguel > > -----Original Message----- > From: Walter Underwood <[email protected]> > Sent: Tuesday, September 27, 2022 2:25 PM > To: [email protected] > Subject: Re: Solr Search - Mixed Case Issue > > CAUTION: This email originated from outside the organization. Do not click > links or open attachments unless you recognize the sender and expect that the > content is safe. > > Honestly, this analysis chain is a mess. > > * StandardTokenizer has parsing support for email addresses, so that is a > better choice. > * Never mix phonetic transformation and stemming, use different chains. > Phonetic tokens aren’t stemmable. > * Don’t stem email addresses. > * Don’t do phonetic transforms on email addresses unless you really want that. > * Don’t remove stopwords ever, but especially for email addresses. > * Don’t do word delimiter splitting on email addresses unless you really want > that. > > For stopwords, let’s assume that “in” is in stopwords.txt. That means it > corrupts every email address from India. > > Instead, use a chain that looks like this. You shouldn’t need separate index > and query chains. > > * StandardTokenizerFactory > * LowercaseFilterFactory > > Using HTMLStripCharFilterFactory for preprocessing probably doesn’t hurt, but > shouldn’t be necessary. If someone is using “>” in your content or > queries, things are a little weird. > > I do like to use Unicode normalization to take care of stuff like curly > quotes. That also has more complete lowercasing support. You’ll probably need > to include the ICU libraries. > > <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" > mode="compose”/> > > To test, make an analysis chain like this, then use the analysis tool in the > UI to see if it does what you want. If it does that, then you can reindex. > > wunder > Walter Underwood > [email protected] <mailto:[email protected]> > https://urldefense.com/v3/__http://observer.wunderwood.org/__;!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE3lpDPnG0$ > (my blog) > >> On Sep 27, 2022, at 8:06 AM, Markus Jelsma <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hello Miguel, >> >> That's likely due to catenateAll/catenateWords. McNeil is first split >> so you can find it using 'mc neil', but not 'mcneil'. Using the >> catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can >> become 'mcneil' again. >> >> If you haven't already, use Solr's analysis GUI [1] for testing these >> configurations. It shows step by step what becomes of the index- and >> query-time analysis chains, and if they match up in the end. >> >> Regards, >> Markus >> >> [1] >> https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$ >> >> <https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$ >> > >> >> Op di 27 sep. 2022 om 16:54 schreef Miguel Joy >> <[email protected] <mailto:[email protected]>>: >> >>> Hi Markus, >>> >>> Thanks so much for your recommendations. Matching the >>> splitOnCaseChange attributes index-time with the query-time, partially >>> fixed our issue. >>> Now, if I search for [email protected] >>> <mailto:[email protected]> and provide the exact same case as the >>> email is stored I get a successful result! However, if I search using >>> [email protected] <mailto:[email protected]> (all lower-case), it >>> doesn't match. >>> Essentially, only if I search using the exact same case as the email >>> is stored do I get results. Any additional ideas on how I can get >>> the email search to fully work? Thanks again for your help. >>> >>> -Miguel >>> >>> >>> >>> -----Original Message----- >>> From: Miguel Joy >>> Sent: Tuesday, September 27, 2022 6:43 AM >>> To: [email protected] <mailto:[email protected]> >>> Subject: RE: Solr Search - Mixed Case Issue >>> >>> Hi Markus, >>> >>> Thanks for your prompt reply to my issue. I will try your >>> suggestions and report back. >>> >>> Thanks, >>> -Miguel >>> >>> -----Original Message----- >>> From: Markus Jelsma <[email protected] >>> <mailto:[email protected]>> >>> Sent: Tuesday, September 27, 2022 6:36 AM >>> To: [email protected] <mailto:[email protected]> >>> Subject: Re: Solr Search - Mixed Case Issue >>> >>> CAUTION: This email originated from outside the organization. Do not >>> click links or open attachments unless you recognize the sender and >>> expect that the content is safe. >>> >>> Hello Miguel, >>> >>> The problem lies with the different index-time and query-time >>> WordDelimiterFilter configurations. >>> >>>> In addition, its strange that we get search results on some mixed >>>> case >>> email addresses >>> >>> Yes, precisely! >>> >>> See the splitOnCaseChange attributes, that is where the problem is. >>> In your case you should be able to copy the index-time configuration >>> to the query-time and get rid of the problem without reindex. It >>> 'should' solve the problem. If not, try to enable catenateAll, on >>> both sides, but that requires reindex. >>> >>> Ideally you should probably also get rid of the StopFilterFactory, >>> unless very well configured (which i do not suspect) it will cause >>> additional weird problems. This does require reindexing. >>> >>> Regards, >>> Markus >>> >>> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy >>> <[email protected] <mailto:[email protected]>>: >>> >>>> Hi all, >>>> >>>> I'm new to Solr and recently inherited a Solr application (version >>>> 5.4) from a previous developer with very little documentation. At >>>> any rate, my problem is this: >>>> >>>> I have some email addresses that are stored as mixed case. >>>> >>>> [email protected] >>>> <mailto:[email protected]><mailto:[email protected] >>>> <mailto:[email protected]>> = Success [querying for this email >>>> address and passing in the full email address in any case [upper or >>>> lower] returns the correct result] >>>> >>>> [email protected] >>>> <mailto:[email protected]><mailto:[email protected] >>>> <mailto:[email protected]>> = Fail [querying for this email >>>> address and passing in the full email address in any case [upper or >>>> lower] returns zero results] >>>> >>>> And here's the fieldType definition that's used for email addresses: >>>> >>>> <fieldType name="text_phonetic" class="solr.TextField" >>>> positionIncrementGap="100" autoGeneratePhraseQueries="true"> >>>> <analyzer type="index"> >>>> <charFilter class="solr.HTMLStripCharFilterFactory"/> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> <filter class="solr.StopFilterFactory" >>>> ignoreCase="true" >>>> words="stopwords.txt" >>>> /> >>>> <filter class="solr.WordDelimiterFilterFactory" >>>> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" >>>> splitOnNumerics="0"/> >>>> <filter class="solr.PhoneticFilterFactory" >>>> encoder="Caverphone" inject="true"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.KeywordMarkerFilterFactory" >>>> protected="protwords.txt"/> >>>> <filter class="solr.PorterStemFilterFactory"/> >>>> </analyzer> >>>> <analyzer type="query"> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> <filter class="solr.SynonymFilterFactory" >>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >>>> <filter class="solr.StopFilterFactory" >>>> ignoreCase="true" >>>> words="stopwords.txt" >>>> /> >>>> <filter class="solr.WordDelimiterFilterFactory" >>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" >>>> splitOnNumerics="0"/> >>>> <filter >>> class="solr.PhoneticFilterFactory" >>>> encoder="Caverphone" inject="true"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.KeywordMarkerFilterFactory" >>>> protected="protwords.txt"/> >>>> <filter class="solr.PorterStemFilterFactory"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> I've spent a couple days researching this issue, and my best guess >>>> at a fix would be to re-index this data using the >>>> LowerCaseFilterFatory so that all email addresses are stored in >>>> lower case, but that would be a significant change as I have over 10 >>>> million docs indexed. In addition, its strange that we get search >>>> results on some mixed case email addresses, but not all, so I'm >>>> hoping that maybe all we need is to tweak the query analyzer? >>>> Thanks in advance for your help with this question. Please let me know if >>>> you need any additional details. >>>> >>>> -Miguel >>>> >>>> >>>> >>>> ________________________________ >>>> >>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised >>>> sublicensees (including Ovation Travel Group and Egencia) use >>>> certain trademarks and service marks of American Express Company or >>>> its subsidiaries (American Express) in the 'American Express Global >>>> Business Travel' and 'American Express Meetings & Events' brands and >>>> in connection with its business for permitted uses only under a >>>> limited licence from American Express (Licensed Marks). The Licensed >>>> Marks are trademarks or service marks of, and the property of, >>>> American Express. GBT UK is a subsidiary of Global Business Travel >>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest >>>> in GBTG, which operates as a separate company from American Express. >>>> >>>> ________________________________ >>>> >>>> This email message and all attachments transmitted with it are >>>> solely for the use of the intended recipient(s) and may contain >>>> confidential and/or privileged information. If the reader of this >>>> message is not the intended recipient, you are hereby notified that >>>> any dissemination, distribution, copying and/or other use of this >>>> message or its attachments is strictly prohibited. If you have >>>> received this message in error, please notify the sender and delete it >>>> immediately. >>>> Unintended transmission shall not constitute a waiver of the >>> attorney-client or any other privilege. >>>> >>>> ________________________________ >>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de >>>> sous-licence autoris?s (notamment Ovation Travel Group et Egencia) >>>> utilise certaines marques commerciales et marques de services >>>> d'American Express Company ou de ses filiales (American Express) >>>> dans les marques < American Express Global Business Travel > et < >>>> American Express Meetings & Events > ainsi qu'en lien avec son >>>> activit?, ? des fins autoris?es uniquement, sous une licence limit?e >>>> accord?e par >>> American Express (marques sous licence). >>>> Les marques sous licence sont des marques commerciales ou des >>>> marques de services d'American Express, dont elles sont la >>>> propri?t?. GBT UK est une filiale de Global Business Travel Group, Inc. >>>> (NYSE : GBTG). >>>> American Express d?tient une participation minoritaire dans GBTG, >>>> qui op?re en tant que soci?t? distincte d'American Express. >>>> >>>> ________________________________ >>>> >>>> Ce message ?lectronique et toutes les pi?ces jointes transmises avec >>>> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires >>>> vis?s et peuvent contenir des informations confidentielles et/ou >>>> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire >>> pr?vu, vous ?tes inform? >>>> par la pr?sente que toute diffusion, distribution, copie et/ou autre >>>> utilisation de ce message ou de ses pi?ces jointes est strictement >>>> interdite. Si vous avez re?u ce message par erreur, veuillez en >>>> informer l'exp?diteur et le supprimer imm?diatement. Une >>>> transmission involontaire ne constitue pas une renonciation au >>>> secret professionnel ou ? toute autre pr?rogative. >>>> >>>> ________________________________ >>>> >>> >>> >>> ________________________________ >>> >>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised >>> sublicensees (including Ovation Travel Group and Egencia) use certain >>> trademarks and service marks of American Express Company or its >>> subsidiaries (American Express) in the 'American Express Global >>> Business Travel' and 'American Express Meetings & Events' brands and >>> in connection with its business for permitted uses only under a >>> limited licence from American Express (Licensed Marks). The Licensed >>> Marks are trademarks or service marks of, and the property of, >>> American Express. GBT UK is a subsidiary of Global Business Travel >>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest >>> in GBTG, which operates as a separate company from American Express. >>> >>> ________________________________ >>> >>> This email message and all attachments transmitted with it are solely >>> for the use of the intended recipient(s) and may contain confidential >>> and/or privileged information. If the reader of this message is not >>> the intended recipient, you are hereby notified that any >>> dissemination, distribution, copying and/or other use of this message >>> or its attachments is strictly prohibited. If you have received this >>> message in error, please notify the sender and delete it immediately. >>> Unintended transmission shall not constitute a waiver of the >>> attorney-client or any other privilege. >>> >>> ________________________________ >>> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de >>> sous-licence autorisés (notamment Ovation Travel Group et Egencia) >>> utilise certaines marques commerciales et marques de services >>> d’American Express Company ou de ses filiales (American Express) dans >>> les marques « American Express Global Business Travel » et « American >>> Express Meetings & Events » ainsi qu’en lien avec son activité, à des >>> fins autorisées uniquement, sous une licence limitée accordée par American >>> Express (marques sous licence). >>> Les marques sous licence sont des marques commerciales ou des marques >>> de services d’American Express, dont elles sont la propriété. GBT UK >>> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). >>> American Express détient une participation minoritaire dans GBTG, qui >>> opère en tant que société distincte d’American Express. >>> >>> ________________________________ >>> >>> Ce message électronique et toutes les pièces jointes transmises avec >>> celui-ci sont uniquement destinés à l’usage du ou des destinataires >>> visés et peuvent contenir des informations confidentielles et/ou >>> privilégiées. Si le lecteur de ce message n’est pas le destinataire >>> prévu, vous êtes informé par la présente que toute diffusion, >>> distribution, copie et/ou autre utilisation de ce message ou de ses >>> pièces jointes est strictement interdite. Si vous avez reçu ce >>> message par erreur, veuillez en informer l’expéditeur et le supprimer >>> immédiatement. Une transmission involontaire ne constitue pas une >>> renonciation au secret professionnel ou à toute autre prérogative. >>> >>> ________________________________ >>> > > > > ________________________________ > > Notice: GBT Travel Services UK Limited (GBT UK) and its authorised > sublicensees (including Ovation Travel Group and Egencia) use certain > trademarks and service marks of American Express Company or its subsidiaries > (American Express) in the 'American Express Global Business Travel' and > 'American Express Meetings & Events' brands and in connection with its > business for permitted uses only under a limited licence from American > Express (Licensed Marks). The Licensed Marks are trademarks or service marks > of, and the property of, American Express. GBT UK is a subsidiary of Global > Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority > interest in GBTG, which operates as a separate company from American Express. > > ________________________________ > > This email message and all attachments transmitted with it are solely for the > use of the intended recipient(s) and may contain confidential and/or > privileged information. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > copying and/or other use of this message or its attachments is strictly > prohibited. If you have received this message in error, please notify the > sender and delete it immediately. Unintended transmission shall not > constitute a waiver of the attorney-client or any other privilege. > > ________________________________ > Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de > sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise > certaines marques commerciales et marques de services d’American Express > Company ou de ses filiales (American Express) dans les marques « American > Express Global Business Travel » et « American Express Meetings & Events » > ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous > une licence limitée accordée par American Express (marques sous licence). Les > marques sous licence sont des marques commerciales ou des marques de services > d’American Express, dont elles sont la propriété. GBT UK est une filiale de > Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient > une participation minoritaire dans GBTG, qui opère en tant que société > distincte d’American Express. > > ________________________________ > > Ce message électronique et toutes les pièces jointes transmises avec celui-ci > sont uniquement destinés à l’usage du ou des destinataires visés et peuvent > contenir des informations confidentielles et/ou privilégiées. Si le lecteur > de ce message n’est pas le destinataire prévu, vous êtes informé par la > présente que toute diffusion, distribution, copie et/ou autre utilisation de > ce message ou de ses pièces jointes est strictement interdite. Si vous avez > reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer > immédiatement. Une transmission involontaire ne constitue pas une > renonciation au secret professionnel ou à toute autre prérogative. > > ________________________________
