RE: tf and very short text fields
Hi - In this case Walter, iirc, was looking for two things: no normalization and no flat TF (1f for tf(float freq) 0). We know that k1 controls TF saturation but in BM25Similarity you can see that k1 is multiplied by the encoded norm value, taking b also into account. So setting k1 to zero effectively disabled length normalization and results in flat or binary TF. Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, term occurs three times in the field: 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 1.0 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.0 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 0.98619986 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.2 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength You can clearly see the final TF norm being 1, despite the term frequency and length. Please correct my wrongs :) Markus -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 3rd April 2014 20:18 To: solr-user@lucene.apache.org Subject: Re: tf and very short text fields Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set b to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for new york, new york will be twice that of the score for new york since without normalization the tf in new york new york is twice that of new york. I think the earlier suggestion to override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote: Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: tf and very short text fields
Hi, Another dimple approach is: If you don't use phrase query or phrase boosting, you can set omitTermFreqAndPositions=true Ahmet On Friday, April 4, 2014 2:38 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - In this case Walter, iirc, was looking for two things: no normalization and no flat TF (1f for tf(float freq) 0). We know that k1 controls TF saturation but in BM25Similarity you can see that k1 is multiplied by the encoded norm value, taking b also into account. So setting k1 to zero effectively disabled length normalization and results in flat or binary TF. Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, term occurs three times in the field: 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 1.0 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.0 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 0.98619986 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.2 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength You can clearly see the final TF norm being 1, despite the term frequency and length. Please correct my wrongs :) Markus -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 3rd April 2014 20:18 To: solr-user@lucene.apache.org Subject: Re: tf and very short text fields Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set b to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for new york, new york will be twice that of the score for new york since without normalization the tf in new york new york is twice that of new york. I think the earlier suggestion to override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote: Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: tf and very short text fields
Thanks Marcus, I was thinking about normalization and was absolutely wrong about setting K1 to zero. I should have taken a look at the algorithm and walked through setting K=0. (This is easier to do looking at the formula in wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though the code.) When you set k1 to 0 it does just what you said i.e provides binary tf. That part of the formula returns 1 if the term is present and 0 if not. Which is I think what Wunder was trying to accomplish. Sorry about jumping in without double checking things first. Tom On Fri, Apr 4, 2014 at 7:38 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi - In this case Walter, iirc, was looking for two things: no normalization and no flat TF (1f for tf(float freq) 0). We know that k1 controls TF saturation but in BM25Similarity you can see that k1 is multiplied by the encoded norm value, taking b also into account. So setting k1 to zero effectively disabled length normalization and results in flat or binary TF. Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, term occurs three times in the field: 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 1.0 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.0 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5 ), product of: 6.4 = boost 4.406719 = idf(docFreq=1, docCount=122) 0.98619986 = tfNorm, computed from: 1.5 = phraseFreq=1.5 0.2 = parameter k1 0.75 = parameter b 8.721312 = avgFieldLength 16.0 = fieldLength You can clearly see the final TF norm being 1, despite the term frequency and length. Please correct my wrongs :) Markus -Original message- From:Tom Burton-West tburt...@umich.edu Sent: Thursday 3rd April 2014 20:18 To: solr-user@lucene.apache.org Subject: Re: tf and very short text fields Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set b to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for new york, new york will be twice that of the score for new york since without normalization the tf in new york new york is twice that of new york. I think the earlier suggestion to override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.org wrote: Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: tf and very short text fields
On 4/1/14 2:32 PM, Walter Underwood wrote: And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org Walter - you can write a custom scoring function in Java, or use function queries to compose one in Solr query language. I don't see a exists(term) function in the list here https://cwiki.apache.org/confluence/display/solr/Function+Queries that would return 0 or 1, but you could write that? -Mike
Re: tf and very short text fields
On 4/3/14 7:46 AM, Michael Sokolov wrote: On 4/1/14 2:32 PM, Walter Underwood wrote: And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org Walter - you can write a custom scoring function in Java, or use function queries to compose one in Solr query language. I don't see a exists(term) function in the list here https://cwiki.apache.org/confluence/display/solr/Function+Queries that would return 0 or 1, but you could write that? -Mike I see I missed Markus' earlier responses - somehow the messages didn't get threaded together in my reader. I may have to try BM25 too!
Re: tf and very short text fields
Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a score, but I don't think it would help for very short documents without messing up its original design purpose. For BM25, if you want to turn off length normalization, you set b to 0. However, I don't think that will do what you want, since turning off normalization will mean that the score for new york, new york will be twice that of the score for new york since without normalization the tf in new york new york is twice that of new york. I think the earlier suggestion to override tfidfsimilarity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote: Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: tf and very short text fields
Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org
Re: Re: tf and very short text fields
Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org
Re: tf and very short text fields
Thanks! We'll try that out and report back. I keep forgetting that I want to try BM25, so this is a good excuse. wunder On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Also, if i remember correctly, k1 set to zero for bm25 automatically omits norms in the calculation. So thats easy to play with without reindexing. Markus Jelsma markus.jel...@openindex.io schreef:Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero in your schema. Walter Underwood wun...@wunderwood.org schreef:And here is another peculiarity of short text fields. The movie New York, New York should not be twice as relevant for the query new york. Is there a way to use a binary term frequency rather than a count? wunder -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org