Thanks Marcus, I was thinking about normalization and was absolutely wrong about setting K1 to zero. I should have taken a look at the algorithm and walked through setting K=0. (This is easier to do looking at the formula in wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though the code.) When you set k1 to 0 it does just what you said i.e provides binary tf. That part of the formula returns 1 if the term is present and 0 if not. Which is I think what Wunder was trying to accomplish.
Sorry about jumping in without double checking things first. Tom On Fri, Apr 4, 2014 at 7:38 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi - In this case Walter, iirc, was looking for two things: no > normalization and no flat TF (1f for tf(float freq) > 0). We know that k1 > controls TF saturation but in BM25Similarity you can see that k1 is > multiplied by the encoded norm value, taking b also into account. So > setting k1 to zero effectively disabled length normalization and results in > flat or binary TF. > > Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the > field, term occurs three times in the field: > > 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 > ), product of: > 6.4 = boost > 4.406719 = idf(docFreq=1, docCount=122) > 1.0 = tfNorm, computed from: > 1.5 = phraseFreq=1.5 > 0.0 = parameter k1 > 0.75 = parameter b > 8.721312 = avgFieldLength > 16.0 = fieldLength > > > > > 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 > ), product of: > 6.4 = boost > 4.406719 = idf(docFreq=1, docCount=122) > 0.98619986 = tfNorm, computed from: > 1.5 = phraseFreq=1.5 > 0.2 = parameter k1 > 0.75 = parameter b > 8.721312 = avgFieldLength > 16.0 = fieldLength > > > You can clearly see the final TF norm being 1, despite the term frequency > and length. Please correct my wrongs :) > Markus > > > > -----Original message----- > > From:Tom Burton-West <tburt...@umich.edu> > > Sent: Thursday 3rd April 2014 20:18 > > To: solr-user@lucene.apache.org > > Subject: Re: tf and very short text fields > > > > Hi Markus and Wunder, > > > > I'm missing the original context, but I don't think BM25 will solve this > > particular problem. > > > > The k1 parameter sets how quickly the contribution of tf to the score > falls > > off with increasing tf. It would be helpful for making sure really long > > documents don't get too high a score, but I don't think it would help for > > very short documents without messing up its original design purpose. > > > > For BM25, if you want to turn off length normalization, you set "b" to 0. > > However, I don't think that will do what you want, since turning off > > normalization will mean that the score for "new york, new york" will be > > twice that of the score for "new york" since without normalization the tf > > in "new york new york" is twice that of "new york". > > > > I think the earlier suggestion to "override tfidfsimilarity and emit 1f > in > > tf() is probably the best way to switch to eliminate using tf counts, > > assumming that is really what you want. > > > > Tom > > > > > > > > > > > > > > > > > > On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wun...@wunderwood.org > >wrote: > > > > > Thanks! We'll try that out and report back. I keep forgetting that I > want > > > to try BM25, so this is a good excuse. > > > > > > wunder > > > > > > On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jel...@openindex.io > > > > > wrote: > > > > > > > Also, if i remember correctly, k1 set to zero for bm25 automatically > > > omits norms in the calculation. So thats easy to play with without > > > reindexing. > > > > > > > > > > > > Markus Jelsma <markus.jel...@openindex.io> schreef:Yes, override > > > tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set > to > > > zero in your schema. > > > > > > > > > > > > Walter Underwood <wun...@wunderwood.org> schreef:And here is another > > > peculiarity of short text fields. > > > > > > > > The movie "New York, New York" should not be twice as relevant for > the > > > query "new york". Is there a way to use a binary term frequency rather > than > > > a count? > > > > > > > > wunder > > > > -- > > > > Walter Underwood > > > > wun...@wunderwood.org > > > > > > > > > > > > > > > > > > -- > > > Walter Underwood > > > wun...@wunderwood.org > > > > > > > > > > > > > > >