I’ve learned these things the hard way from weird behavior in production, 
mostly due to my own mistakes.

I had to debug some really strange results from my configs at Netflix. It turns 
out that you don’t want the movie “Saw” to match “see”, for example. :-) And 
there were several movie titles that completely disappeared after stopword 
removal. Oops. I wrote that up here:

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

My favorite was "To Be and To Have (Être et Avoir)” which is all-stopwords in 
two languages. A great movie, too.

The biggest hassle was a movie titled “+/-“, but that is a different problem.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Sep 27, 2022, at 12:49 PM, Miguel Joy <[email protected]> 
> wrote:
> 
> Hi Walter,
> 
> Thanks very much for your honest feedback.  As I mentioned, I inherited this 
> application so I've been trying to pick up the pieces as best I can.  The 
> solr analysis tool is great, so it's now clear to me how to make changes to 
> the analysis chain and test them using the analysis tool.  I suspect we'll 
> end up having to clean this configuration up and re-index the documents.  
> Again, thanks to you and Markus for the support.  I have what I need now.
> 
> -Miguel
> 
> -----Original Message-----
> From: Walter Underwood <[email protected]>
> Sent: Tuesday, September 27, 2022 2:25 PM
> To: [email protected]
> Subject: Re: Solr Search - Mixed Case Issue
> 
> CAUTION: This email originated from outside the organization. Do not click 
> links or open attachments unless you recognize the sender and expect that the 
> content is safe.
> 
> Honestly, this analysis chain is a mess.
> 
> * StandardTokenizer has parsing support for email addresses, so that is a 
> better choice.
> * Never mix phonetic transformation and stemming, use different chains. 
> Phonetic tokens aren’t stemmable.
> * Don’t stem email addresses.
> * Don’t do phonetic transforms on email addresses unless you really want that.
> * Don’t remove stopwords ever, but especially for email addresses.
> * Don’t do word delimiter splitting on email addresses unless you really want 
> that.
> 
> For stopwords, let’s assume that “in” is in stopwords.txt. That means it 
> corrupts every email address from India.
> 
> Instead, use a chain that looks like this. You shouldn’t need separate index 
> and query chains.
> 
> * StandardTokenizerFactory
> * LowercaseFilterFactory
> 
> Using HTMLStripCharFilterFactory for preprocessing probably doesn’t hurt, but 
> shouldn’t be necessary. If someone is using “&gt;” in your content or 
> queries, things are a little weird.
> 
> I do like to use Unicode normalization to take care of stuff like curly 
> quotes. That also has more complete lowercasing support. You’ll probably need 
> to include the ICU libraries.
> 
> <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" 
> mode="compose”/>
> 
> To test, make an analysis chain like this, then use the analysis tool in the 
> UI to see if it does what you want. If it does that, then you can reindex.
> 
> wunder
> Walter Underwood
> [email protected] <mailto:[email protected]>
> https://urldefense.com/v3/__http://observer.wunderwood.org/__;!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE3lpDPnG0$
>     (my blog)
> 
>> On Sep 27, 2022, at 8:06 AM, Markus Jelsma <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hello Miguel,
>> 
>> That's likely due to catenateAll/catenateWords. McNeil is first split
>> so you can find it using 'mc neil', but not 'mcneil'. Using the
>> catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can
>> become 'mcneil' again.
>> 
>> If you haven't already, use Solr's analysis GUI [1] for testing these
>> configurations. It shows step by step what becomes of the index- and
>> query-time analysis chains, and if they match up in the end.
>> 
>> Regards,
>> Markus
>> 
>> [1] 
>> https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$
>>    
>> <https://urldefense.com/v3/__http://localhost:8983/solr/*/COLLECTION/analysis__;Iw!!KLL8VBKIGhc0BcQ38Y9qmONVtVtEUw!0VFJJA810iQFh90X2GGhTLkEm690_FPVxzeyQOfnonGOWKy2lZxrxdq4NQMxzihboXvfrjK4wDKJwRE35nq4Sgo$
>>   >
>> 
>> Op di 27 sep. 2022 om 16:54 schreef Miguel Joy
>> <[email protected] <mailto:[email protected]>>:
>> 
>>> Hi Markus,
>>> 
>>> Thanks so much for your recommendations.  Matching the
>>> splitOnCaseChange attributes  index-time with the query-time, partially 
>>> fixed our issue.
>>> Now, if I search for [email protected]
>>> <mailto:[email protected]> and provide the exact same case as the
>>> email is stored I get a successful result!  However, if I search using 
>>> [email protected] <mailto:[email protected]> (all lower-case), it 
>>> doesn't match.
>>> Essentially, only if I search using the exact same case as the email
>>> is stored do I get results.  Any additional ideas on how I can get
>>> the email search to fully work?  Thanks again for your help.
>>> 
>>> -Miguel
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Miguel Joy
>>> Sent: Tuesday, September 27, 2022 6:43 AM
>>> To: [email protected] <mailto:[email protected]>
>>> Subject: RE: Solr Search - Mixed Case Issue
>>> 
>>> Hi Markus,
>>> 
>>> Thanks for your prompt reply to my issue.  I will try your
>>> suggestions and report back.
>>> 
>>> Thanks,
>>> -Miguel
>>> 
>>> -----Original Message-----
>>> From: Markus Jelsma <[email protected]
>>> <mailto:[email protected]>>
>>> Sent: Tuesday, September 27, 2022 6:36 AM
>>> To: [email protected] <mailto:[email protected]>
>>> Subject: Re: Solr Search - Mixed Case Issue
>>> 
>>> CAUTION: This email originated from outside the organization. Do not
>>> click links or open attachments unless you recognize the sender and
>>> expect that the content is safe.
>>> 
>>> Hello Miguel,
>>> 
>>> The problem lies with the different index-time and query-time
>>> WordDelimiterFilter configurations.
>>> 
>>>> In addition, its strange that we get search results on some mixed
>>>> case
>>> email addresses
>>> 
>>> Yes, precisely!
>>> 
>>> See the splitOnCaseChange attributes, that is where the problem is.
>>> In your case you should be able to copy the index-time configuration
>>> to the query-time and get rid of the problem without reindex. It
>>> 'should' solve the problem. If not, try to enable catenateAll, on
>>> both sides, but that requires reindex.
>>> 
>>> Ideally you should probably also get rid of the StopFilterFactory,
>>> unless very well configured (which i do not suspect) it will cause
>>> additional weird problems. This does require reindexing.
>>> 
>>> Regards,
>>> Markus
>>> 
>>> Op di 27 sep. 2022 om 11:55 schreef Miguel Joy
>>> <[email protected] <mailto:[email protected]>>:
>>> 
>>>> Hi all,
>>>> 
>>>> I'm new to Solr and recently inherited a Solr application (version
>>>> 5.4) from a previous developer with very little documentation.  At
>>>> any rate, my problem is this:
>>>> 
>>>> I have some email addresses that are stored as mixed case.
>>>> 
>>>> [email protected]
>>>> <mailto:[email protected]><mailto:[email protected]
>>>> <mailto:[email protected]>> = Success [querying for this email
>>>> address and passing in the full email address in any case [upper or
>>>> lower] returns the correct result]
>>>> 
>>>> [email protected]
>>>> <mailto:[email protected]><mailto:[email protected]
>>>> <mailto:[email protected]>> = Fail [querying for this email
>>>> address and passing in the full email address in any case [upper or
>>>> lower] returns zero results]
>>>> 
>>>> And here's the fieldType definition that's used for email addresses:
>>>> 
>>>> <fieldType name="text_phonetic" class="solr.TextField"
>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>>     <analyzer type="index">
>>>>       <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
>>>> splitOnNumerics="0"/>
>>>>               <filter class="solr.PhoneticFilterFactory"
>>>> encoder="Caverphone" inject="true"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.PorterStemFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>                               <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
>>>> splitOnNumerics="0"/>
>>>>                               <filter
>>> class="solr.PhoneticFilterFactory"
>>>> encoder="Caverphone" inject="true"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.KeywordMarkerFilterFactory"
>>>> protected="protwords.txt"/>
>>>>               <filter class="solr.PorterStemFilterFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> I've spent a couple days researching this issue, and my best guess
>>>> at a fix would be to re-index this data using the
>>>> LowerCaseFilterFatory so that all email addresses are stored in
>>>> lower case, but that would be a significant change as I have over 10
>>>> million docs indexed.  In addition, its strange that we get search
>>>> results on some mixed case email addresses, but not all, so I'm
>>>> hoping that maybe all we need is to tweak the query analyzer?
>>>> Thanks in advance for your help with this question.  Please let me know if 
>>>> you need any additional details.
>>>> 
>>>> -Miguel
>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> 
>>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>>> sublicensees (including Ovation Travel Group and Egencia) use
>>>> certain trademarks and service marks of American Express Company or
>>>> its subsidiaries (American Express) in the 'American Express Global
>>>> Business Travel' and 'American Express Meetings & Events' brands and
>>>> in connection with its business for permitted uses only under a
>>>> limited licence from American Express (Licensed Marks). The Licensed
>>>> Marks are trademarks or service marks of, and the property of,
>>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>>> in GBTG, which operates as a separate company from American Express.
>>>> 
>>>> ________________________________
>>>> 
>>>> This email message and all attachments transmitted with it are
>>>> solely for the use of the intended recipient(s) and may contain
>>>> confidential and/or privileged information. If the reader of this
>>>> message is not the intended recipient, you are hereby notified that
>>>> any dissemination, distribution, copying and/or other use of this
>>>> message or its attachments is strictly prohibited. If you have
>>>> received this message in error, please notify the sender and delete it 
>>>> immediately.
>>>> Unintended transmission shall not constitute a waiver of the
>>> attorney-client or any other privilege.
>>>> 
>>>> ________________________________
>>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de
>>>> sous-licence autoris?s (notamment Ovation Travel Group et Egencia)
>>>> utilise certaines marques commerciales et marques de services
>>>> d'American Express Company ou de ses filiales (American Express)
>>>> dans les marques < American Express Global Business Travel > et <
>>>> American Express Meetings & Events > ainsi qu'en lien avec son
>>>> activit?, ? des fins autoris?es uniquement, sous une licence limit?e
>>>> accord?e par
>>> American Express (marques sous licence).
>>>> Les marques sous licence sont des marques commerciales ou des
>>>> marques de services d'American Express, dont elles sont la
>>>> propri?t?. GBT UK est une filiale de Global Business Travel Group, Inc. 
>>>> (NYSE : GBTG).
>>>> American Express d?tient une participation minoritaire dans GBTG,
>>>> qui op?re en tant que soci?t? distincte d'American Express.
>>>> 
>>>> ________________________________
>>>> 
>>>> Ce message ?lectronique et toutes les pi?ces jointes transmises avec
>>>> celui-ci sont uniquement destin?s ? l'usage du ou des destinataires
>>>> vis?s et peuvent contenir des informations confidentielles et/ou
>>>> privil?gi?es. Si le lecteur de ce message n'est pas le destinataire
>>> pr?vu, vous ?tes inform?
>>>> par la pr?sente que toute diffusion, distribution, copie et/ou autre
>>>> utilisation de ce message ou de ses pi?ces jointes est strictement
>>>> interdite. Si vous avez re?u ce message par erreur, veuillez en
>>>> informer l'exp?diteur et le supprimer imm?diatement. Une
>>>> transmission involontaire ne constitue pas une renonciation au
>>>> secret professionnel ou ? toute autre pr?rogative.
>>>> 
>>>> ________________________________
>>>> 
>>> 
>>> 
>>> ________________________________
>>> 
>>> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised
>>> sublicensees (including Ovation Travel Group and Egencia) use certain
>>> trademarks and service marks of American Express Company or its
>>> subsidiaries (American Express) in the 'American Express Global
>>> Business Travel' and 'American Express Meetings & Events' brands and
>>> in connection with its business for permitted uses only under a
>>> limited licence from American Express (Licensed Marks). The Licensed
>>> Marks are trademarks or service marks of, and the property of,
>>> American Express. GBT UK is a subsidiary of Global Business Travel
>>> Group, Inc. (NYSE: GBTG). American Express holds a minority interest
>>> in GBTG, which operates as a separate company from American Express.
>>> 
>>> ________________________________
>>> 
>>> This email message and all attachments transmitted with it are solely
>>> for the use of the intended recipient(s) and may contain confidential
>>> and/or privileged information. If the reader of this message is not
>>> the intended recipient, you are hereby notified that any
>>> dissemination, distribution, copying and/or other use of this message
>>> or its attachments is strictly prohibited. If you have received this
>>> message in error, please notify the sender and delete it immediately.
>>> Unintended transmission shall not constitute a waiver of the 
>>> attorney-client or any other privilege.
>>> 
>>> ________________________________
>>> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de
>>> sous-licence autorisés (notamment Ovation Travel Group et Egencia)
>>> utilise certaines marques commerciales et marques de services
>>> d’American Express Company ou de ses filiales (American Express) dans
>>> les marques « American Express Global Business Travel » et « American
>>> Express Meetings & Events » ainsi qu’en lien avec son activité, à des
>>> fins autorisées uniquement, sous une licence limitée accordée par American 
>>> Express (marques sous licence).
>>> Les marques sous licence sont des marques commerciales ou des marques
>>> de services d’American Express, dont elles sont la propriété. GBT UK
>>> est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG).
>>> American Express détient une participation minoritaire dans GBTG, qui
>>> opère en tant que société distincte d’American Express.
>>> 
>>> ________________________________
>>> 
>>> Ce message électronique et toutes les pièces jointes transmises avec
>>> celui-ci sont uniquement destinés à l’usage du ou des destinataires
>>> visés et peuvent contenir des informations confidentielles et/ou
>>> privilégiées. Si le lecteur de ce message n’est pas le destinataire
>>> prévu, vous êtes informé par la présente que toute diffusion,
>>> distribution, copie et/ou autre utilisation de ce message ou de ses
>>> pièces jointes est strictement interdite. Si vous avez reçu ce
>>> message par erreur, veuillez en informer l’expéditeur et le supprimer
>>> immédiatement. Une transmission involontaire ne constitue pas une
>>> renonciation au secret professionnel ou à toute autre prérogative.
>>> 
>>> ________________________________
>>> 
> 
> 
> 
> ________________________________
> 
> Notice: GBT Travel Services UK Limited (GBT UK) and its authorised 
> sublicensees (including Ovation Travel Group and Egencia) use certain 
> trademarks and service marks of American Express Company or its subsidiaries 
> (American Express) in the 'American Express Global Business Travel' and 
> 'American Express Meetings & Events' brands and in connection with its 
> business for permitted uses only under a limited licence from American 
> Express (Licensed Marks). The Licensed Marks are trademarks or service marks 
> of, and the property of, American Express. GBT UK is a subsidiary of Global 
> Business Travel Group, Inc. (NYSE: GBTG). American Express holds a minority 
> interest in GBTG, which operates as a separate company from American Express.
> 
> ________________________________
> 
> This email message and all attachments transmitted with it are solely for the 
> use of the intended recipient(s) and may contain confidential and/or 
> privileged information. If the reader of this message is not the intended 
> recipient, you are hereby notified that any dissemination, distribution, 
> copying and/or other use of this message or its attachments is strictly 
> prohibited. If you have received this message in error, please notify the 
> sender and delete it immediately. Unintended transmission shall not 
> constitute a waiver of the attorney-client or any other privilege.
> 
> ________________________________
> Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de 
> sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise 
> certaines marques commerciales et marques de services d’American Express 
> Company ou de ses filiales (American Express) dans les marques « American 
> Express Global Business Travel » et « American Express Meetings & Events » 
> ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous 
> une licence limitée accordée par American Express (marques sous licence). Les 
> marques sous licence sont des marques commerciales ou des marques de services 
> d’American Express, dont elles sont la propriété. GBT UK est une filiale de 
> Global Business Travel Group, Inc. (NYSE : GBTG). American Express détient 
> une participation minoritaire dans GBTG, qui opère en tant que société 
> distincte d’American Express.
> 
> ________________________________
> 
> Ce message électronique et toutes les pièces jointes transmises avec celui-ci 
> sont uniquement destinés à l’usage du ou des destinataires visés et peuvent 
> contenir des informations confidentielles et/ou privilégiées. Si le lecteur 
> de ce message n’est pas le destinataire prévu, vous êtes informé par la 
> présente que toute diffusion, distribution, copie et/ou autre utilisation de 
> ce message ou de ses pièces jointes est strictement interdite. Si vous avez 
> reçu ce message par erreur, veuillez en informer l’expéditeur et le supprimer 
> immédiatement. Une transmission involontaire ne constitue pas une 
> renonciation au secret professionnel ou à toute autre prérogative.
> 
> ________________________________

Reply via email to