Re: tf and very short text fields

Tom Burton-West Fri, 04 Apr 2014 08:07:29 -0700

Thanks Marcus,

I was thinking about normalization and was absolutely wrong about setting
K1 to zero.   I should have taken a look at the algorithm and walked
through setting K=0.  (This is easier to do looking at the formula in
wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though the
code.)
When you set k1 to 0 it does just what you said i.e provides binary tf.
 That part of the formula  returns 1 if the term is present and 0 if not.
Which is I think what Wunder was trying to accomplish.


Sorry about jumping in without double checking things first.

Tom


On Fri, Apr 4, 2014 at 7:38 AM, Markus Jelsma <markus.jel...@openindex.io>wrote:

> Hi - In this case Walter, iirc, was looking for two things: no
> normalization and no flat TF (1f for tf(float freq) > 0). We know that k1
> controls TF saturation but in BM25Similarity you can see that k1 is
> multiplied by the encoded norm value, taking b also into account. So
> setting k1 to zero effectively disabled length normalization and results in
> flat or binary TF.
>
> Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the
> field, term occurs three times in the field:
>
>         28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
> ), product of:
>           6.4 = boost
>           4.406719 = idf(docFreq=1, docCount=122)
>           1.0 = tfNorm, computed from:
>             1.5 = phraseFreq=1.5
>             0.0 = parameter k1
>             0.75 = parameter b
>             8.721312 = avgFieldLength
>             16.0 = fieldLength
>
>
>
>
>         27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
> ), product of:
>           6.4 = boost
>           4.406719 = idf(docFreq=1, docCount=122)
>           0.98619986 = tfNorm, computed from:
>             1.5 = phraseFreq=1.5
>             0.2 = parameter k1
>             0.75 = parameter b
>             8.721312 = avgFieldLength
>             16.0 = fieldLength
>
>
> You can clearly see the final TF norm being 1, despite the term frequency
> and length. Please correct my wrongs :)
> Markus
>
>
>
> -----Original message-----
> > From:Tom Burton-West <tburt...@umich.edu>
> > Sent: Thursday 3rd April 2014 20:18
> > To: solr-user@lucene.apache.org
> > Subject: Re: tf and very short text fields
> >
> > Hi Markus and Wunder,
> >
> > I'm  missing the original context, but I don't think BM25 will solve this
> > particular problem.
> >
> > The k1 parameter sets how quickly the contribution of tf to the score
> falls
> > off with increasing tf.   It would be helpful for making sure really long
> > documents don't get too high a score, but I don't think it would help for
> > very short documents without messing up its original design purpose.
> >
> > For BM25, if you want to turn off length normalization, you set "b" to 0.
> >  However, I don't think that will do what you want, since turning off
> > normalization will mean that the score for "new york, new york"  will be
> > twice that of the score for "new york" since without normalization the tf
> > in "new york new york" is twice that of "new york".
> >
> > I think the earlier suggestion to "override tfidfsimilarity and emit 1f
> in
> > tf() is probably the best way to switch to eliminate using tf counts,
> > assumming that is really what you want.
> >
> > Tom
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wun...@wunderwood.org
> >wrote:
> >
> > > Thanks! We'll try that out and report back. I keep forgetting that I
> want
> > > to try BM25, so this is a good excuse.
> > >
> > > wunder
> > >
> > > On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jel...@openindex.io
> >
> > > wrote:
> > >
> > > > Also, if i remember correctly, k1 set to zero for bm25 automatically
> > > omits norms in the calculation. So thats easy to play with without
> > > reindexing.
> > > >
> > > >
> > > > Markus Jelsma <markus.jel...@openindex.io> schreef:Yes, override
> > > tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set
> to
> > > zero in your schema.
> > > >
> > > >
> > > > Walter Underwood <wun...@wunderwood.org> schreef:And here is another
> > > peculiarity of short text fields.
> > > >
> > > > The movie "New York, New York" should not be twice as relevant for
> the
> > > query "new york". Is there a way to use a binary term frequency rather
> than
> > > a count?
> > > >
> > > > wunder
> > > > --
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > >
> > > >
> > > >
> > >
> > > --
> > > Walter Underwood
> > > wun...@wunderwood.org
> > >
> > >
> > >
> > >
> >
>

Re: tf and very short text fields

Reply via email to