Re: edismax, phrase field gets ignored for keyword tokenizer

Stefan Matheis Mon, 07 Nov 2016 08:25:22 -0800

Which is everything fine by itself - but doesn’t shed more light on my
initial question Vincenzo, does it? probably i shoudn’t have mentioned
partial matches in the first place, that might have lead into the
wrong direction - they are not relevant for now / not for this
question.


I’d like to know why & where edismax drops out phrase fields which are
using a Keyword Tokenizer. Maybe there is a larger idea behind this
behavior, but i don’t see it (yet).

-Stefan


On November 7, 2016 at 5:09:04 PM, Vincenzo D'Amore (v.dam...@gmail.com) wrote:
> If you don't want partial matches with edismax you should always use
> StandardTokenizerFactory and play with mm parameter.
>
> On Mon, Nov 7, 2016 at 4:50 PM, Stefan Matheis
> wrote:
>
> > Vincenzo,
> >
> > thanks for the response - i know that only the Keyword Tokenizer by
> > itself does not do anything. as pointed at the end of the initial
> > mail, i’m applying a pattern replace for everything non-numeric to
> > make it actually useful.
> >
> > and especially because of the tokenization based on whitespaces i’d
> > like to use the very same field once again as phrase field to around
> > this issue. Shawn mentioned in #solr in the meantime that there is
> > SOLR-9185 which is similar and would be helpful, but currently very
> > very in-the-works.
> >
> > Standard Tokenizer you’ve mentioned does split on whitespace - as
> > edismax does by default in the first place. so i’m not sure how that
> > would help? For now, i don’t want to have partial matches on phone
> > numbers .. at least not yet.
> >
> > -Stefan
> >
> >
> > On November 7, 2016 at 4:41:50 PM, Vincenzo D'Amore (v.dam...@gmail.com)
> > wrote:
> > > Hi Stefan,
> > >
> > > I think the problem is solr.KeywordTokenizerFactory.
> > > This tokeniser does not make any tokenisation to the string, it returns
> > > exactly what you have.
> > >
> > > '+49 1234 12345678' -> '+49 1234 12345678'
> > >
> > > On the other hand, using edismax you are looking for '+49', '1234' and
> > > '12345678' and none of these keywords match your phone_number field.
> > >
> > > Try using a different tokenizer like solr.StandardTokenizerFactory, this
> > > should change your results.
> > >
> > > Bests,
> > > Vincenzo
> > >
> > > On Mon, Nov 7, 2016 at 4:05 PM, Stefan Matheis
> > > wrote:
> > >
> > > > I’m guessing that i’m missing something obvious here - so feel free to
> > > > ask for more details and as well point out other directions i should
> > > > following.
> > > >
> > > > the problem goes as follows: the input in one case might be a phone
> > > > number (like +49 1234 12345678), since we’re using edismax the parts
> > > > gets split on whitespaces - which is fine. bringing the same field
> > > > (based on TextField) to the party (using qf) doesn’t change a thing.
> > > >
> > > > > responseHeader:
> > > > > params:
> > > > > q: '+49 1234 12345678'
> > > > > defType: edismax
> > > > > qf: person_mobile
> > > > > pf: person_mobile^5
> > > > > debug:
> > > > > rawquerystring: '+49 1234 12345678'
> > > > > querystring: '+49 1234 12345678'
> > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > mobile:12345678)))
> > > > ())/no_coord'
> > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > (person_mobile:12345678)) ()’
> > > >
> > > > but .. as far as i was able to reduce the culprit, that only happens
> > > > when i’m using solr.KeywordTokenizerFactory . as soon as i’m changing
> > > > that to solr.StandardTokenizerFactory the phrase query appears as
> > > > expected:
> > > >
> > > > > responseHeader:
> > > > > params:
> > > > > q: '+49 1234 12345678'
> > > > > defType: edismax
> > > > > qf: person_mobile
> > > > > pf: person_mobile^5
> > > > > debug:
> > > > > rawquerystring: '+49 1234 12345678'
> > > > > querystring: '+49 1234 12345678'
> > > > > parsedquery: '(+(+DisjunctionMaxQuery((person_mobile:49))
> > > > DisjunctionMaxQuery((person_mobile:1234)) DisjunctionMaxQuery((person_
> > mobile:12345678)))
> > > > DisjunctionMaxQuery(((person_mobile:"49 1234
> > 12345678")^5.0)))/no_coord'
> > > > > parsedquery_toString: '+(+(person_mobile:49) (person_mobile:1234)
> > > > (person_mobile:12345678)) ((person_mobile:"49 1234 12345678")^5.0)’
> > > >
> > > > removing the + at the beginning, doesn’t make a difference either
> > > > (just mentioning since tokee already asked this on #solr, where i’ve
> > > > brought up the question earlier)
> > > >
> > > > it’s absolutely possible i’m focusing on a very wrong assumption - but
> > > > since switching the tokenizer does result in such a rather large
> > > > behaviour change, i think something is spooky here.
> > > >
> > > > i’ve read older issues and posts from the list, some of them pointed
> > > > out that it might be a optimization that edismax brings to the table -
> > > > i didn’t find anything specific about that.
> > > >
> > > > oh, and btw: if that would be working - my idea is to drop out
> > > > everything for a given phrase that is not a number, to match the phone
> > > > number, like this:
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > > replacement=""/>
> > > > >
> > > > >
> > > >
> > > > any thoughts? or wild guesses?
> > > >
> > > > Thanks Stefan
> > > >
> > >
> > >
> > >
> > > --
> > > Vincenzo D'Amore
> > > email: v.dam...@gmail.com
> > > skype: free.dev
> > > mobile: +39 349 8513251
> > >
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>

Re: edismax, phrase field gets ignored for keyword tokenizer

Reply via email to