So I managed to get the tokenizing to work with both PatternTokenizerFactory and WordDelimiterFilterFactory (used in combination with WhitespaceTokenizerFactory). For PT I used a regex that matches the various permutations of the phrases, and for WDF/WT I used protected words with every permutation (there are only 40 or 50).
In both cases, via the admin/analysis screen, the Index and Query values were tokenized correctly (for example, "Super Vitamin C" was tokenized as "Super" and "Vitamin C"). However, when I do a query like "DisplayName:(Super Vitamin C)" with "debug=query", I see that the parsed query is "DisplayName:Super DisplayName:Vitamin DisplayName:C" ("DisplayName" is the field I'm working on here). Shouldn't that instead be parsed as something like "DIsplayName:Super DisplayName:"Vitamin C"" or something similar? Or am I not understanding how query parsing works? In either case, I'm seeing results where DisplayName contains things like "Vitamin B 90 Caps" or "Super Orange 30 pkts", neither of which contain the phrase "Vitamin C", so I suspect something is wrong. On Thu, Mar 23, 2017 at 8:08 AM, Joel Bernstein <joels...@gmail.com> wrote: > You can also checkout > https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers- > RegularExpressionPatternTokenizer > . > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, Mar 22, 2017 at 7:52 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Susheel: > > > > That'll work, but the options you've specified for > > WordDelimiterFilterFactory pretty much make it so it's doing nothing. > > I realize it's commented out... > > > > That said, it's true that if you have a very specific pattern you want > > to recognize a Regex can do the trick. WDFF is a bit more generic > > though when you have less specific requirements. > > > > Best, > > Erick > > > > On Wed, Mar 22, 2017 at 12:56 PM, Susheel Kumar <susheel2...@gmail.com> > > wrote: > > > I have used PatternReplaceFilterFactory in some of these situations. > e.g. > > > below > > > > > > <tokenizer class="solr.ClassicTokenizerFactory"/> <!-- <filter > > > class="solr.WordDelimiterFilterFactory" generateWordParts="0" > > > generateNumberParts="0" catenateWords="0" catenateNumbers="1" > > > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" /> --> > <filter > > > class="solr.PatternReplaceFilterFactory" pattern="(\d+)-(\d+)-?(\d+)$" > > > replacement="$1$2$3"/> > > > > > > On Wed, Mar 22, 2017 at 2:54 PM, Mark Johnson < > > mjohn...@emersonecologics.com > > >> wrote: > > > > > >> Awesome, thank you much! > > >> > > >> On Wed, Mar 22, 2017 at 2:38 PM, Erick Erickson < > > erickerick...@gmail.com> > > >> wrote: > > >> > > >> > Take a close look at WordDelimiterFilterFactory, it's designed to > deal > > >> > with things like part numbers, phone numbers and the like, and the > > >> > example you gave is in the same class of problem I think. It'll take > > >> > a bit to get your head around what it does, but it'll perfom better > > >> > than regexes, assuming you can get what you need out of it. > > >> > > > >> > And the admin/analysis page will help you _greatly_ in understanding > > >> > what the effects of the various parameters are. > > >> > > > >> > Best, > > >> > Erick > > >> > > > >> > On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson > > >> > <mjohn...@emersonecologics.com> wrote: > > >> > > Is it possible to configure Solr to treat text that matches a > regex > > as > > >> a > > >> > > phrase? > > >> > > > > >> > > I have a database full of products, and the Title and Description > > >> fields > > >> > > are text_en, tokenized via the StandardTokenizerFactory. This > works > > in > > >> > most > > >> > > cases, but a number of products have names like: > > >> > > > > >> > > - Vitamin A > > >> > > - Vitamin-A > > >> > > - Vitamin B12 > > >> > > - Vitamin B-12 > > >> > > ...and so on > > >> > > > > >> > > I have a regex that will match all of the permutations and would > > like > > >> to > > >> > > configure the field type so that anything that matches the regex > > >> pattern > > >> > is > > >> > > treated as a single token, instead of being broken up by spaces, > > etc. > > >> Is > > >> > > that possible? > > >> > > > > >> > > -- > > >> > > *This message is intended only for the use of the individual or > > entity > > >> to > > >> > > which it is addressed and may contain information that is > > privileged, > > >> > > confidential and exempt from disclosure under applicable law. If > you > > >> have > > >> > > received this message in error, you are hereby notified that any > > use, > > >> > > dissemination, distribution or copying of this message is > > prohibited. > > >> If > > >> > > you have received this communication in error, please notify the > > sender > > >> > > immediately and destroy the transmitted information.* > > >> > > > >> > > >> > > >> > > >> -- > > >> > > >> Best Regards, > > >> > > >> *Mark Johnson* | .NET Software Engineer > > >> > > >> Office: 603-392-7017 > > >> > > >> Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH | > > >> 03101 > > >> > > >> <http://www.emersonecologics.com/> <https://wellevate.me/#/> > > >> > > >> *Supporting The Practice Of Healthy Living* > > >> > > >> <http://blog.emersonecologics.com/> > > >> <https://www.linkedin.com/company/emerson-ecologics> > > >> <https://www.facebook.com/emersonecologics/> > > >> <https://twitter.com/EmersonEcologic> > > >> <https://www.instagram.com/emerson_ecologics/> > > >> <https://www.pinterest.com/emersonecologic/> > > >> <https://www.glassdoor.com/Overview/Working-at-Emerson- > > >> Ecologics-EI_IE388367.11,28.htm> > > >> > > >> -- > > >> *This message is intended only for the use of the individual or entity > > to > > >> which it is addressed and may contain information that is privileged, > > >> confidential and exempt from disclosure under applicable law. If you > > have > > >> received this message in error, you are hereby notified that any use, > > >> dissemination, distribution or copying of this message is prohibited. > If > > >> you have received this communication in error, please notify the > sender > > >> immediately and destroy the transmitted information.* > > >> > > > -- Best Regards, *Mark Johnson* | .NET Software Engineer Office: 603-392-7017 Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH | 03101 <http://www.emersonecologics.com/> <https://wellevate.me/#/> *Supporting The Practice Of Healthy Living* <http://blog.emersonecologics.com/> <https://www.linkedin.com/company/emerson-ecologics> <https://www.facebook.com/emersonecologics/> <https://twitter.com/EmersonEcologic> <https://www.instagram.com/emerson_ecologics/> <https://www.pinterest.com/emersonecologic/> <https://www.glassdoor.com/Overview/Working-at-Emerson-Ecologics-EI_IE388367.11,28.htm> -- *This message is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you have received this message in error, you are hereby notified that any use, dissemination, distribution or copying of this message is prohibited. If you have received this communication in error, please notify the sender immediately and destroy the transmitted information.*