So I managed to get the tokenizing to work with
both PatternTokenizerFactory and WordDelimiterFilterFactory (used in
combination with WhitespaceTokenizerFactory). For PT I used a regex that
matches the various permutations of the phrases, and for WDF/WT I used
protected words with every permutation (there are only 40 or 50).

In both cases, via the admin/analysis screen, the Index and Query values
were tokenized correctly (for example, "Super Vitamin C" was tokenized as
"Super" and "Vitamin C").

However, when I do a query like "DisplayName:(Super Vitamin C)" with
"debug=query", I see that the parsed query is "DisplayName:Super
DisplayName:Vitamin DisplayName:C" ("DisplayName" is the field I'm working
on here).

Shouldn't that instead be parsed as something like "DIsplayName:Super
DisplayName:"Vitamin C"" or something similar? Or am I not understanding
how query parsing works?

In either case, I'm seeing results where DisplayName contains things like
"Vitamin B 90 Caps" or "Super Orange 30 pkts", neither of which contain the
phrase "Vitamin C", so I suspect something is wrong.

On Thu, Mar 23, 2017 at 8:08 AM, Joel Bernstein <joels...@gmail.com> wrote:

> You can also checkout
> https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-
> RegularExpressionPatternTokenizer
> .
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Mar 22, 2017 at 7:52 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Susheel:
> >
> > That'll work, but the options you've specified for
> > WordDelimiterFilterFactory pretty much make it so it's doing nothing.
> > I realize it's commented out...
> >
> > That said, it's true that if you have a very specific pattern you want
> > to recognize a Regex can do the trick. WDFF is a bit more generic
> > though when you have less specific requirements.
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 22, 2017 at 12:56 PM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> > > I have used PatternReplaceFilterFactory in some of these situations.
> e.g.
> > > below
> > >
> > > <tokenizer class="solr.ClassicTokenizerFactory"/> <!-- <filter
> > > class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> > > generateNumberParts="0" catenateWords="0" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" /> -->
> <filter
> > > class="solr.PatternReplaceFilterFactory" pattern="(\d+)-(\d+)-?(\d+)$"
> > > replacement="$1$2$3"/>
> > >
> > > On Wed, Mar 22, 2017 at 2:54 PM, Mark Johnson <
> > mjohn...@emersonecologics.com
> > >> wrote:
> > >
> > >> Awesome, thank you much!
> > >>
> > >> On Wed, Mar 22, 2017 at 2:38 PM, Erick Erickson <
> > erickerick...@gmail.com>
> > >> wrote:
> > >>
> > >> > Take a close look at WordDelimiterFilterFactory, it's designed to
> deal
> > >> > with things like part numbers, phone numbers and the like, and the
> > >> > example you gave is in the same class of problem I think. It'll take
> > >> > a bit to get your head around what it does, but it'll perfom better
> > >> > than regexes, assuming you can get what you need out of it.
> > >> >
> > >> > And the admin/analysis page will help you _greatly_ in understanding
> > >> > what the effects of the various parameters are.
> > >> >
> > >> > Best,
> > >> > Erick
> > >> >
> > >> > On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson
> > >> > <mjohn...@emersonecologics.com> wrote:
> > >> > > Is it possible to configure Solr to treat text that matches a
> regex
> > as
> > >> a
> > >> > > phrase?
> > >> > >
> > >> > > I have a database full of products, and the Title and Description
> > >> fields
> > >> > > are text_en, tokenized via the StandardTokenizerFactory. This
> works
> > in
> > >> > most
> > >> > > cases, but a number of products have names like:
> > >> > >
> > >> > >  - Vitamin A
> > >> > >  - Vitamin-A
> > >> > >  - Vitamin B12
> > >> > >  - Vitamin B-12
> > >> > > ...and so on
> > >> > >
> > >> > > I have a regex that will match all of the permutations and would
> > like
> > >> to
> > >> > > configure the field type so that anything that matches the regex
> > >> pattern
> > >> > is
> > >> > > treated as a single token, instead of being broken up by spaces,
> > etc.
> > >> Is
> > >> > > that possible?
> > >> > >
> > >> > > --
> > >> > > *This message is intended only for the use of the individual or
> > entity
> > >> to
> > >> > > which it is addressed and may contain information that is
> > privileged,
> > >> > > confidential and exempt from disclosure under applicable law. If
> you
> > >> have
> > >> > > received this message in error, you are hereby notified that any
> > use,
> > >> > > dissemination, distribution or copying of this message is
> > prohibited.
> > >> If
> > >> > > you have received this communication in error, please notify the
> > sender
> > >> > > immediately and destroy the transmitted information.*
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> Best Regards,
> > >>
> > >> *Mark Johnson* | .NET Software Engineer
> > >>
> > >> Office: 603-392-7017
> > >>
> > >> Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH |
> > >> 03101
> > >>
> > >> <http://www.emersonecologics.com/>  <https://wellevate.me/#/>
> > >>
> > >> *Supporting The Practice Of Healthy Living*
> > >>
> > >> <http://blog.emersonecologics.com/>
> > >> <https://www.linkedin.com/company/emerson-ecologics>
> > >> <https://www.facebook.com/emersonecologics/>
> > >> <https://twitter.com/EmersonEcologic>
> > >> <https://www.instagram.com/emerson_ecologics/>
> > >> <https://www.pinterest.com/emersonecologic/>
> > >> <https://www.glassdoor.com/Overview/Working-at-Emerson-
> > >> Ecologics-EI_IE388367.11,28.htm>
> > >>
> > >> --
> > >> *This message is intended only for the use of the individual or entity
> > to
> > >> which it is addressed and may contain information that is privileged,
> > >> confidential and exempt from disclosure under applicable law. If you
> > have
> > >> received this message in error, you are hereby notified that any use,
> > >> dissemination, distribution or copying of this message is prohibited.
> If
> > >> you have received this communication in error, please notify the
> sender
> > >> immediately and destroy the transmitted information.*
> > >>
> >
>



-- 

Best Regards,

*Mark Johnson* | .NET Software Engineer

Office: 603-392-7017

Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH | 03101

<http://www.emersonecologics.com/>  <https://wellevate.me/#/>

*Supporting The Practice Of Healthy Living*

<http://blog.emersonecologics.com/>
<https://www.linkedin.com/company/emerson-ecologics>
<https://www.facebook.com/emersonecologics/>
<https://twitter.com/EmersonEcologic>
<https://www.instagram.com/emerson_ecologics/>
<https://www.pinterest.com/emersonecologic/>
<https://www.glassdoor.com/Overview/Working-at-Emerson-Ecologics-EI_IE388367.11,28.htm>

-- 
*This message is intended only for the use of the individual or entity to 
which it is addressed and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If you have 
received this message in error, you are hereby notified that any use, 
dissemination, distribution or copying of this message is prohibited. If 
you have received this communication in error, please notify the sender 
immediately and destroy the transmitted information.*

Reply via email to