Hello Miguel, That's likely due to catenateAll/catenateWords. McNeil is first split so you can find it using 'mc neil', but not 'mcneil'. Using the catenate*-settings the split terms 'McNeil' into 'mc' 'neil' can become 'mcneil' again.
If you haven't already, use Solr's analysis GUI [1] for testing these configurations. It shows step by step what becomes of the index- and query-time analysis chains, and if they match up in the end. Regards, Markus [1] http://localhost:8983/solr/#/COLLECTION/analysis Op di 27 sep. 2022 om 16:54 schreef Miguel Joy <[email protected]>: > Hi Markus, > > Thanks so much for your recommendations. Matching the splitOnCaseChange > attributes index-time with the query-time, partially fixed our issue. > Now, if I search for [email protected] and provide the exact same > case as the email is stored I get a successful result! However, if I > search using [email protected] (all lower-case), it doesn't match. > Essentially, only if I search using the exact same case as the email is > stored do I get results. Any additional ideas on how I can get the email > search to fully work? Thanks again for your help. > > -Miguel > > > > -----Original Message----- > From: Miguel Joy > Sent: Tuesday, September 27, 2022 6:43 AM > To: [email protected] > Subject: RE: Solr Search - Mixed Case Issue > > Hi Markus, > > Thanks for your prompt reply to my issue. I will try your suggestions and > report back. > > Thanks, > -Miguel > > -----Original Message----- > From: Markus Jelsma <[email protected]> > Sent: Tuesday, September 27, 2022 6:36 AM > To: [email protected] > Subject: Re: Solr Search - Mixed Case Issue > > CAUTION: This email originated from outside the organization. Do not click > links or open attachments unless you recognize the sender and expect that > the content is safe. > > Hello Miguel, > > The problem lies with the different index-time and query-time > WordDelimiterFilter configurations. > > > In addition, its strange that we get search results on some mixed case > email addresses > > Yes, precisely! > > See the splitOnCaseChange attributes, that is where the problem is. In > your case you should be able to copy the index-time configuration to the > query-time and get rid of the problem without reindex. It 'should' solve > the problem. If not, try to enable catenateAll, on both sides, but that > requires reindex. > > Ideally you should probably also get rid of the StopFilterFactory, unless > very well configured (which i do not suspect) it will cause additional > weird problems. This does require reindexing. > > Regards, > Markus > > Op di 27 sep. 2022 om 11:55 schreef Miguel Joy > <[email protected]>: > > > Hi all, > > > > I'm new to Solr and recently inherited a Solr application (version > > 5.4) from a previous developer with very little documentation. At any > > rate, my problem is this: > > > > I have some email addresses that are stored as mixed case. > > > > [email protected]<mailto:[email protected]> = Success [querying for > > this email address and passing in the full email address in any case > > [upper or lower] returns the correct result] > > > > [email protected]<mailto:[email protected]> = Fail [querying > > for this email address and passing in the full email address in any > > case [upper or lower] returns zero results] > > > > And here's the fieldType definition that's used for email addresses: > > > > <fieldType name="text_phonetic" class="solr.TextField" > > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > > <analyzer type="index"> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > words="stopwords.txt" > > /> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" > > splitOnNumerics="0"/> > > <filter class="solr.PhoneticFilterFactory" > > encoder="Caverphone" inject="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymFilterFactory" > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > words="stopwords.txt" > > /> > > <filter class="solr.WordDelimiterFilterFactory" > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" > > splitOnNumerics="0"/> > > <filter > class="solr.PhoneticFilterFactory" > > encoder="Caverphone" inject="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.KeywordMarkerFilterFactory" > > protected="protwords.txt"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > </fieldType> > > > > I've spent a couple days researching this issue, and my best guess at > > a fix would be to re-index this data using the LowerCaseFilterFatory > > so that all email addresses are stored in lower case, but that would > > be a significant change as I have over 10 million docs indexed. In > > addition, its strange that we get search results on some mixed case > > email addresses, but not all, so I'm hoping that maybe all we need is > > to tweak the query analyzer? Thanks in advance for your help with > > this question. Please let me know if you need any additional details. > > > > -Miguel > > > > > > > > ________________________________ > > > > Notice: GBT Travel Services UK Limited (GBT UK) and its authorised > > sublicensees (including Ovation Travel Group and Egencia) use certain > > trademarks and service marks of American Express Company or its > > subsidiaries (American Express) in the 'American Express Global > > Business Travel' and 'American Express Meetings & Events' brands and > > in connection with its business for permitted uses only under a > > limited licence from American Express (Licensed Marks). The Licensed > > Marks are trademarks or service marks of, and the property of, > > American Express. GBT UK is a subsidiary of Global Business Travel > > Group, Inc. (NYSE: GBTG). American Express holds a minority interest > > in GBTG, which operates as a separate company from American Express. > > > > ________________________________ > > > > This email message and all attachments transmitted with it are solely > > for the use of the intended recipient(s) and may contain confidential > > and/or privileged information. If the reader of this message is not > > the intended recipient, you are hereby notified that any > > dissemination, distribution, copying and/or other use of this message > > or its attachments is strictly prohibited. If you have received this > > message in error, please notify the sender and delete it immediately. > > Unintended transmission shall not constitute a waiver of the > attorney-client or any other privilege. > > > > ________________________________ > > Avis : GBT Travel Services UK Limited (GBT UK) et ses d?tenteurs de > > sous-licence autoris?s (notamment Ovation Travel Group et Egencia) > > utilise certaines marques commerciales et marques de services > > d'American Express Company ou de ses filiales (American Express) dans > > les marques < American Express Global Business Travel > et < American > > Express Meetings & Events > ainsi qu'en lien avec son activit?, ? des > > fins autoris?es uniquement, sous une licence limit?e accord?e par > American Express (marques sous licence). > > Les marques sous licence sont des marques commerciales ou des marques > > de services d'American Express, dont elles sont la propri?t?. GBT UK > > est une filiale de Global Business Travel Group, Inc. (NYSE : GBTG). > > American Express d?tient une participation minoritaire dans GBTG, qui > > op?re en tant que soci?t? distincte d'American Express. > > > > ________________________________ > > > > Ce message ?lectronique et toutes les pi?ces jointes transmises avec > > celui-ci sont uniquement destin?s ? l'usage du ou des destinataires > > vis?s et peuvent contenir des informations confidentielles et/ou > > privil?gi?es. Si le lecteur de ce message n'est pas le destinataire > pr?vu, vous ?tes inform? > > par la pr?sente que toute diffusion, distribution, copie et/ou autre > > utilisation de ce message ou de ses pi?ces jointes est strictement > > interdite. Si vous avez re?u ce message par erreur, veuillez en > > informer l'exp?diteur et le supprimer imm?diatement. Une transmission > > involontaire ne constitue pas une renonciation au secret professionnel > > ou ? toute autre pr?rogative. > > > > ________________________________ > > > > > ________________________________ > > Notice: GBT Travel Services UK Limited (GBT UK) and its authorised > sublicensees (including Ovation Travel Group and Egencia) use certain > trademarks and service marks of American Express Company or its > subsidiaries (American Express) in the 'American Express Global Business > Travel' and 'American Express Meetings & Events' brands and in connection > with its business for permitted uses only under a limited licence from > American Express (Licensed Marks). The Licensed Marks are trademarks or > service marks of, and the property of, American Express. GBT UK is a > subsidiary of Global Business Travel Group, Inc. (NYSE: GBTG). American > Express holds a minority interest in GBTG, which operates as a separate > company from American Express. > > ________________________________ > > This email message and all attachments transmitted with it are solely for > the use of the intended recipient(s) and may contain confidential and/or > privileged information. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > copying and/or other use of this message or its attachments is strictly > prohibited. If you have received this message in error, please notify the > sender and delete it immediately. Unintended transmission shall not > constitute a waiver of the attorney-client or any other privilege. > > ________________________________ > Avis : GBT Travel Services UK Limited (GBT UK) et ses détenteurs de > sous-licence autorisés (notamment Ovation Travel Group et Egencia) utilise > certaines marques commerciales et marques de services d’American Express > Company ou de ses filiales (American Express) dans les marques « American > Express Global Business Travel » et « American Express Meetings & Events » > ainsi qu’en lien avec son activité, à des fins autorisées uniquement, sous > une licence limitée accordée par American Express (marques sous licence). > Les marques sous licence sont des marques commerciales ou des marques de > services d’American Express, dont elles sont la propriété. GBT UK est une > filiale de Global Business Travel Group, Inc. (NYSE : GBTG). American > Express détient une participation minoritaire dans GBTG, qui opère en tant > que société distincte d’American Express. > > ________________________________ > > Ce message électronique et toutes les pièces jointes transmises avec > celui-ci sont uniquement destinés à l’usage du ou des destinataires visés > et peuvent contenir des informations confidentielles et/ou privilégiées. Si > le lecteur de ce message n’est pas le destinataire prévu, vous êtes informé > par la présente que toute diffusion, distribution, copie et/ou autre > utilisation de ce message ou de ses pièces jointes est strictement > interdite. Si vous avez reçu ce message par erreur, veuillez en informer > l’expéditeur et le supprimer immédiatement. Une transmission involontaire > ne constitue pas une renonciation au secret professionnel ou à toute autre > prérogative. > > ________________________________ >
