RE: tf and very short text fields

2014-04-04 Thread Markus Jelsma
Hi - In this case Walter, iirc, was looking for two things: no normalization 
and no flat TF (1f for tf(float freq)  0). We know that k1 controls TF 
saturation but in BM25Similarity you can see that k1 is multiplied by the 
encoded norm value, taking b also into account. So setting k1 to zero 
effectively disabled length normalization and results in flat or binary TF. 

Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, 
term occurs three times in the field:

28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
  6.4 = boost
  4.406719 = idf(docFreq=1, docCount=122)
  1.0 = tfNorm, computed from:
1.5 = phraseFreq=1.5
0.0 = parameter k1
0.75 = parameter b
8.721312 = avgFieldLength
16.0 = fieldLength




27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
  6.4 = boost
  4.406719 = idf(docFreq=1, docCount=122)
  0.98619986 = tfNorm, computed from:
1.5 = phraseFreq=1.5
0.2 = parameter k1
0.75 = parameter b
8.721312 = avgFieldLength
16.0 = fieldLength


You can clearly see the final TF norm being 1, despite the term frequency and 
length. Please correct my wrongs :)
Markus

 
 
-Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Thursday 3rd April 2014 20:18
 To: solr-user@lucene.apache.org
 Subject: Re: tf and very short text fields
 
 Hi Markus and Wunder,
 
 I'm  missing the original context, but I don't think BM25 will solve this
 particular problem.
 
 The k1 parameter sets how quickly the contribution of tf to the score falls
 off with increasing tf.   It would be helpful for making sure really long
 documents don't get too high a score, but I don't think it would help for
 very short documents without messing up its original design purpose.
 
 For BM25, if you want to turn off length normalization, you set b to 0.
  However, I don't think that will do what you want, since turning off
 normalization will mean that the score for new york, new york  will be
 twice that of the score for new york since without normalization the tf
 in new york new york is twice that of new york.
 
 I think the earlier suggestion to override tfidfsimilarity and emit 1f in
 tf() is probably the best way to switch to eliminate using tf counts,
 assumming that is really what you want.
 
 Tom
 
 
 
 
 
 
 
 
 On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote:
 
  Thanks! We'll try that out and report back. I keep forgetting that I want
  to try BM25, so this is a good excuse.
 
  wunder
 
  On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:
 
   Also, if i remember correctly, k1 set to zero for bm25 automatically
  omits norms in the calculation. So thats easy to play with without
  reindexing.
  
  
   Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
  tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
  zero in your schema.
  
  
   Walter Underwood wun...@wunderwood.org schreef:And here is another
  peculiarity of short text fields.
  
   The movie New York, New York should not be twice as relevant for the
  query new york. Is there a way to use a binary term frequency rather than
  a count?
  
   wunder
   --
   Walter Underwood
   wun...@wunderwood.org
  
  
  
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 
 


Re: tf and very short text fields

2014-04-04 Thread Ahmet Arslan
Hi,

Another dimple approach is: 
If you don't use phrase query or phrase boosting, you can set 
omitTermFreqAndPositions=true

Ahmet


On Friday, April 4, 2014 2:38 PM, Markus Jelsma markus.jel...@openindex.io 
wrote:
Hi - In this case Walter, iirc, was looking for two things: no normalization 
and no flat TF (1f for tf(float freq)  0). We know that k1 controls TF 
saturation but in BM25Similarity you can see that k1 is multiplied by the 
encoded norm value, taking b also into account. So setting k1 to zero 
effectively disabled length normalization and results in flat or binary TF. 

Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the field, 
term occurs three times in the field:

        28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
          6.4 = boost
          4.406719 = idf(docFreq=1, docCount=122)
          1.0 = tfNorm, computed from:
            1.5 = phraseFreq=1.5
            0.0 = parameter k1
            0.75 = parameter b
            8.721312 = avgFieldLength
            16.0 = fieldLength




        27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
), product of:
          6.4 = boost
          4.406719 = idf(docFreq=1, docCount=122)
          0.98619986 = tfNorm, computed from:
            1.5 = phraseFreq=1.5
            0.2 = parameter k1
            0.75 = parameter b
            8.721312 = avgFieldLength
            16.0 = fieldLength


You can clearly see the final TF norm being 1, despite the term frequency and 
length. Please correct my wrongs :)
Markus




-Original message-
 From:Tom Burton-West tburt...@umich.edu
 Sent: Thursday 3rd April 2014 20:18
 To: solr-user@lucene.apache.org
 Subject: Re: tf and very short text fields
 
 Hi Markus and Wunder,
 
 I'm  missing the original context, but I don't think BM25 will solve this
 particular problem.
 
 The k1 parameter sets how quickly the contribution of tf to the score falls
 off with increasing tf.   It would be helpful for making sure really long
 documents don't get too high a score, but I don't think it would help for
 very short documents without messing up its original design purpose.
 
 For BM25, if you want to turn off length normalization, you set b to 0.
  However, I don't think that will do what you want, since turning off
 normalization will mean that the score for new york, new york  will be
 twice that of the score for new york since without normalization the tf
 in new york new york is twice that of new york.
 
 I think the earlier suggestion to override tfidfsimilarity and emit 1f in
 tf() is probably the best way to switch to eliminate using tf counts,
 assumming that is really what you want.
 
 Tom
 
 
 
 
 
 
 
 
 On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote:
 
  Thanks! We'll try that out and report back. I keep forgetting that I want
  to try BM25, so this is a good excuse.
 
  wunder
 
  On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:
 
   Also, if i remember correctly, k1 set to zero for bm25 automatically
  omits norms in the calculation. So thats easy to play with without
  reindexing.
  
  
   Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
  tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
  zero in your schema.
  
  
   Walter Underwood wun...@wunderwood.org schreef:And here is another
  peculiarity of short text fields.
  
   The movie New York, New York should not be twice as relevant for the
  query new york. Is there a way to use a binary term frequency rather than
  a count?
  
   wunder
   --
   Walter Underwood
   wun...@wunderwood.org
  
  
  
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 



Re: tf and very short text fields

2014-04-04 Thread Tom Burton-West
Thanks Marcus,

I was thinking about normalization and was absolutely wrong about setting
K1 to zero.   I should have taken a look at the algorithm and walked
through setting K=0.  (This is easier to do looking at the formula in
wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though the
code.)
When you set k1 to 0 it does just what you said i.e provides binary tf.
 That part of the formula  returns 1 if the term is present and 0 if not.
Which is I think what Wunder was trying to accomplish.

Sorry about jumping in without double checking things first.

Tom


On Fri, Apr 4, 2014 at 7:38 AM, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi - In this case Walter, iirc, was looking for two things: no
 normalization and no flat TF (1f for tf(float freq)  0). We know that k1
 controls TF saturation but in BM25Similarity you can see that k1 is
 multiplied by the encoded norm value, taking b also into account. So
 setting k1 to zero effectively disabled length normalization and results in
 flat or binary TF.

 Here's an example output of k1 = 0 and k1 = 0.2. Norms or enabled on the
 field, term occurs three times in the field:

 28.203003 = score(doc=0,freq=1.5 = phraseFreq=1.5
 ), product of:
   6.4 = boost
   4.406719 = idf(docFreq=1, docCount=122)
   1.0 = tfNorm, computed from:
 1.5 = phraseFreq=1.5
 0.0 = parameter k1
 0.75 = parameter b
 8.721312 = avgFieldLength
 16.0 = fieldLength




 27.813797 = score(doc=0,freq=1.5 = phraseFreq=1.5
 ), product of:
   6.4 = boost
   4.406719 = idf(docFreq=1, docCount=122)
   0.98619986 = tfNorm, computed from:
 1.5 = phraseFreq=1.5
 0.2 = parameter k1
 0.75 = parameter b
 8.721312 = avgFieldLength
 16.0 = fieldLength


 You can clearly see the final TF norm being 1, despite the term frequency
 and length. Please correct my wrongs :)
 Markus



 -Original message-
  From:Tom Burton-West tburt...@umich.edu
  Sent: Thursday 3rd April 2014 20:18
  To: solr-user@lucene.apache.org
  Subject: Re: tf and very short text fields
 
  Hi Markus and Wunder,
 
  I'm  missing the original context, but I don't think BM25 will solve this
  particular problem.
 
  The k1 parameter sets how quickly the contribution of tf to the score
 falls
  off with increasing tf.   It would be helpful for making sure really long
  documents don't get too high a score, but I don't think it would help for
  very short documents without messing up its original design purpose.
 
  For BM25, if you want to turn off length normalization, you set b to 0.
   However, I don't think that will do what you want, since turning off
  normalization will mean that the score for new york, new york  will be
  twice that of the score for new york since without normalization the tf
  in new york new york is twice that of new york.
 
  I think the earlier suggestion to override tfidfsimilarity and emit 1f
 in
  tf() is probably the best way to switch to eliminate using tf counts,
  assumming that is really what you want.
 
  Tom
 
 
 
 
 
 
 
 
  On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.org
 wrote:
 
   Thanks! We'll try that out and report back. I keep forgetting that I
 want
   to try BM25, so this is a good excuse.
  
   wunder
  
   On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
 
   wrote:
  
Also, if i remember correctly, k1 set to zero for bm25 automatically
   omits norms in the calculation. So thats easy to play with without
   reindexing.
   
   
Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
   tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set
 to
   zero in your schema.
   
   
Walter Underwood wun...@wunderwood.org schreef:And here is another
   peculiarity of short text fields.
   
The movie New York, New York should not be twice as relevant for
 the
   query new york. Is there a way to use a binary term frequency rather
 than
   a count?
   
wunder
--
Walter Underwood
wun...@wunderwood.org
   
   
   
  
   --
   Walter Underwood
   wun...@wunderwood.org
  
  
  
  
 



Re: tf and very short text fields

2014-04-03 Thread Michael Sokolov

On 4/1/14 2:32 PM, Walter Underwood wrote:

And here is another peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for the query new 
york. Is there a way to use a binary term frequency rather than a count?

wunder
--
Walter Underwood
wun...@wunderwood.org




Walter - you can write a custom scoring function in Java, or use 
function queries to compose one in Solr query language.  I don't see a 
exists(term) function in the list here 
https://cwiki.apache.org/confluence/display/solr/Function+Queries that 
would return 0 or 1, but you could write that?


-Mike


Re: tf and very short text fields

2014-04-03 Thread Michael Sokolov

On 4/3/14 7:46 AM, Michael Sokolov wrote:

On 4/1/14 2:32 PM, Walter Underwood wrote:

And here is another peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for 
the query new york. Is there a way to use a binary term frequency 
rather than a count?


wunder
--
Walter Underwood
wun...@wunderwood.org




Walter - you can write a custom scoring function in Java, or use 
function queries to compose one in Solr query language.  I don't see a 
exists(term) function in the list here 
https://cwiki.apache.org/confluence/display/solr/Function+Queries that 
would return 0 or 1, but you could write that?


-Mike
I see I missed Markus' earlier responses - somehow the messages didn't 
get threaded together in my reader.  I may have to try BM25 too!


Re: tf and very short text fields

2014-04-03 Thread Tom Burton-West
Hi Markus and Wunder,

I'm  missing the original context, but I don't think BM25 will solve this
particular problem.

The k1 parameter sets how quickly the contribution of tf to the score falls
off with increasing tf.   It would be helpful for making sure really long
documents don't get too high a score, but I don't think it would help for
very short documents without messing up its original design purpose.

For BM25, if you want to turn off length normalization, you set b to 0.
 However, I don't think that will do what you want, since turning off
normalization will mean that the score for new york, new york  will be
twice that of the score for new york since without normalization the tf
in new york new york is twice that of new york.

I think the earlier suggestion to override tfidfsimilarity and emit 1f in
tf() is probably the best way to switch to eliminate using tf counts,
assumming that is really what you want.

Tom








On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Thanks! We'll try that out and report back. I keep forgetting that I want
 to try BM25, so this is a good excuse.

 wunder

 On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  Also, if i remember correctly, k1 set to zero for bm25 automatically
 omits norms in the calculation. So thats easy to play with without
 reindexing.
 
 
  Markus Jelsma markus.jel...@openindex.io schreef:Yes, override
 tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
 zero in your schema.
 
 
  Walter Underwood wun...@wunderwood.org schreef:And here is another
 peculiarity of short text fields.
 
  The movie New York, New York should not be twice as relevant for the
 query new york. Is there a way to use a binary term frequency rather than
 a count?
 
  wunder
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 

 --
 Walter Underwood
 wun...@wunderwood.org






Re: tf and very short text fields

2014-04-01 Thread Markus Jelsma
Yes, override tfidfsimilarity and emit 1f in tf(). You can also use bm25 with 
k1 set to zero in your schema.


Walter Underwood wun...@wunderwood.org schreef:And here is another 
peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for the query 
new york. Is there a way to use a binary term frequency rather than a count?

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: Re: tf and very short text fields

2014-04-01 Thread Markus Jelsma
Also, if i remember correctly, k1 set to zero for bm25 automatically omits 
norms in the calculation. So thats easy to play with without reindexing.


Markus Jelsma markus.jel...@openindex.io schreef:Yes, override 
tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to zero 
in your schema.


Walter Underwood wun...@wunderwood.org schreef:And here is another 
peculiarity of short text fields.

The movie New York, New York should not be twice as relevant for the query 
new york. Is there a way to use a binary term frequency rather than a count?

wunder
--
Walter Underwood
wun...@wunderwood.org





Re: tf and very short text fields

2014-04-01 Thread Walter Underwood
Thanks! We'll try that out and report back. I keep forgetting that I want to 
try BM25, so this is a good excuse.

wunder

On Apr 1, 2014, at 12:30 PM, Markus Jelsma markus.jel...@openindex.io wrote:

 Also, if i remember correctly, k1 set to zero for bm25 automatically omits 
 norms in the calculation. So thats easy to play with without reindexing.
 
 
 Markus Jelsma markus.jel...@openindex.io schreef:Yes, override 
 tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to 
 zero in your schema.
 
 
 Walter Underwood wun...@wunderwood.org schreef:And here is another 
 peculiarity of short text fields.
 
 The movie New York, New York should not be twice as relevant for the query 
 new york. Is there a way to use a binary term frequency rather than a count?
 
 wunder
 --
 Walter Underwood
 wun...@wunderwood.org
 
 
 

--
Walter Underwood
wun...@wunderwood.org