Re: Can't get phrase field boosting to work using edismax

Jack Krupansky Wed, 06 Apr 2016 07:31:53 -0700

I haven't traced through all the code recently, so I can't dispute Jan if
he knows a place that checks the output of the pf phrase analysis to see if
it is a single term, but... the INPUT to pf is definitely multiple clauses.
Regardless of the use of the keyword tokenizer, the query parser sees two
tokens, "some" and "words", and passes them as separate clauses to the code
I referenced above, which constructs quoted phrases and passes them through
the query parser again for the pf fields. What happens after that I cannot
say for sure.


But if the pf post-analysis processing does have this limitation that the
analysis of a multi-word phrase must be at least two terms, it should be
clearly documented. That's essentially what is at stake in this particular
issue.

Granted, that was my first thought, that the use of the keyword tokenizer
would be a no-no for a pf field, but this particular use case seems valid
to me, so we should consider whether the "multiple words analyze to one
term" use case should be supported, for precisely the use case at hand.

I can see wanting to have both a multi-term pf field combined with a
single-term pf field with the latter having a higher boost. For example, if
the input query exactly matches a product name field, as opposed to simply
matching a subset of a longer product name.


-- Jack Krupansky

On Wed, Apr 6, 2016 at 5:22 AM, <jimi.hulleg...@svensktnaringsliv.se> wrote:

> OK, well I'm not sure I agree with you. First of all, you ask me to point
> my "pf" towards a tokenized field, but I already do that (the fact that all
> text is tokenized into a single token doesn't change that fact). Also, I
> don't agree with the view that a single term phrase never is
> valid/reasonable. In this specific case, with a KeywordTokenizer, I see it
> as very reasonable indeed. And I would consider a "single term keyword
> phrase" solution more logical than a workaround using special magical
> characters inserted in the text. Just my two cents... :)
>
> Oh, hang on... If a phrase is defined as multiple tokens, and pf is used
> for phrase  boosting, does that mean that even with a regular tokenizer the
> pf won't work for fields that only contain one word? For example if the
> title of one document is "John", and the user searches for 'John' (without
> any surrounding phrase-characters), will edismax not boost this document?
>
> /Jimi
>
> -----Original Message-----
> From: Jan Høydahl [mailto:jan....@cominvent.com]
> Sent: Wednesday, April 6, 2016 10:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Can't get phrase field boosting to work using edismax
>
> Hi,
>
> Phrase match via “pf” requires the target field to contain a phrase. A
> phrase is defined as multiple tokens. Yours does not contain a phrase since
> you use the KeywordTokenizer, leaving only one token in the field. eDismax
> pf will thus never kick in. Please point your “pf” towards a tokenized
> field.
>
> If what you are trying to achieve is to boost only when the whole query
> exactly matches the full content of the field, then have a look at my
> solution here https://github.com/cominvent/exactmatch
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 5. apr. 2016 kl. 19.10 skrev jimi.hulleg...@svensktnaringsliv.se:
> >
> > Some more input, before I call it a day. Just for the heck of it, I
> tried changing minClauseSize to 0 using the Eclipse debugger, so that it
> didn't return null at line 1203, but instead returned the TermQuery on line
> 1205. Then everything worked exactly as it should. The matching document
> got boosted as expected. And in the explain output, this can be seen:
> >
> > [...]
> > 11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172)
> [DefaultSimilarity], result of:
> > [...]
> >
> > So. In my case, having minClauseSize=2 on line 550 (line 565 for solr
> 5.5.0) is the culprit. Is this a bug, or am I using the pf in the wrong
> way? Can someone explain why minClauseSize can't be set to 0 here? The
> comment simply states "we need at least two or there shouldn't be a boost",
> but no explaination *why* at least two is needed.
> >
> > Regards
> > /Jimi
> >
> > -----Original Message-----
> > From: jimi.hulleg...@svensktnaringsliv.se
> > [mailto:jimi.hulleg...@svensktnaringsliv.se]
> > Sent: Tuesday, April 5, 2016 6:51 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Can't get phrase field boosting to work using edismax
> >
> > I now used the Eclipse debugger, to try and see if I can understand what
> is happening, I it seems like the ExtendedDismaxQParser simply ignores my
> pf parameter, since it doesn't interpret it as a phrase query.
> >
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/
> > solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
> >
> > On line 1180 I get a query object of type TermQuery (with the term
> "exactTitle:some words"). And in the if statements starting at line it is
> quite clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if
> the minClauseSize > 1 (and it is set to 2 on line 550) the method simply
> returns null (ie ignoring my pf parameter). Why is this happening?
> >
> > I use Solr 4.6 by the way... I forgot to mention that in my original
> message.
> >
> >
> > -----Original Message-----
> > From: jimi.hulleg...@svensktnaringsliv.se
> > [mailto:jimi.hulleg...@svensktnaringsliv.se]
> > Sent: Tuesday, April 5, 2016 5:36 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Can't get phrase field boosting to work using edismax
> >
> > OK. Interesting. But... I added a solr.TrimFilterFactory at the end of
> my analyzer definition. Shouldn't that take care of the added space at the
> end? The admin analysis page indicates that it works as it should, but I
> still can't get edismax to boost.
> >
> > -----Original Message-----
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Tuesday, April 5, 2016 4:42 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can't get phrase field boosting to work using edismax
> >
> > It looks like the code constructing the boost phrase for pf will always
> add a trailing blank, which is never a problem when a normal tokenizer is
> used that removes white space, but the keyword tokenizer will preserve that
> extra space, which prevents an exact match.
> >
> > See line 531:
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/
> > solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
> >
> > I'd say it's a bug, but more a narrow use case that wasn't considered or
> tested.
> >
> > -- Jack Krupansky
> >
> > On Tue, Apr 5, 2016 at 7:50 AM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
> >
> >> Hi,
> >>
> >> I'm trying to boost documents using a phrase field boosting (ie the
> >> pf parameter for edismax), but I can't get it to work (ie boosting
> >> documents where the pf field match the query as a phrase).
> >>
> >> As far as I can tell, solr, or more specifically the edismax handler,
> >> does
> >> *something* when I add this parameter. I know this because the QTime
> >> increases from around 5-10ms to around 30-40 ms, and the score
> >> explain structure is *slightly* modified (though with the same final
> >> score for all documents). But nowhere in the explain structure can I
> >> see anything about the pf. And I can't understand that. Shouldn't it
> >> be included in the explain? If not, is there any way to force it to be
> included somehow?
> >>
> >> The query looks something like this:
> >>
> >>
> >> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exa
> >> c
> >> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText
> >> 1 %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
> >>
> >>
> >> I have one document that has the title "some words", and when I do a
> >> simple query filter with exactTitle:"some words" I get a match for
> >> that document. So then I would expect that the query above would
> >> boost this document, and include information about this in the
> >> explain. But nothing like this happens, and I can't understand why.
> >>
> >> The field looks like this:
> >>
> >> <field name="exactTitle" type="keywordText" indexed="true" stored="true"
> >> required="false" multiValued="false" />
> >>
> >> And the fieldType looks like this:
> >>
> >> <fieldType name="keywordText" class="solr.TextField"
> >> positionIncrementGap="100">
> >>                         <analyzer>
> >>                                                  <charFilter
> >> class="solr.HTMLStripCharFilterFactory" />
> >>                                                  <tokenizer
> >> class="solr.KeywordTokenizerFactory" />
> >>                                                  <filter
> >> class="solr.LowerCaseFilterFactory" />
> >>                         </analyzer>
> >> </fieldType>
> >>
> >>
> >> I have also tried boosting this document using a boost query, ie
> >> bq=exactTitle:"some words", and this works as expected. The document
> >> score is boosted, and the explain states this very clearly, with this
> segment:
> >>
> >> [...]
> >> 9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12)
> >> [DefaultSimilarity], result of:
> >> [...]
> >>
> >> Why is this working, but q=some+words&pf=exactTitle^5 not? Shouldn't
> >> edismax rewrite my "pf query" into something very similar to the "bq
> query"?
> >>
> >> Regards
> >> /Jimi
> >>
>
>

Re: Can't get phrase field boosting to work using edismax

Reply via email to