Re: Scoring changes between 4.10 and 5.5

2016-06-10 Thread Upayavira
Tracked it down to this ticket:

https://issues.apache.org/jira/browse/LUCENE-6590

which changed the implementation of normalize() in
org.apache.lucene.search.similarities.TFIDFSimilarity.

I've asked for comment on that ticket.

Upayavira

On Fri, 10 Jun 2016, at 01:39 AM, Ahmet Arslan wrote:
> Hi,
> 
> I wondered the same before and failed to decipher TFIDFSimilarity.
> Scoring looks like tf*idf*idf to me.
> 
> I appreciate someone who will shed some light on this.
> 
> Thanks,
> Ahmet
> 
> 
> 
> On Friday, June 10, 2016 12:37 AM, Upayavira  wrote:
> I've just done a very simple, single term query against a 4.10 system
> and a 5.5 system, each with much the same data.
> 
> The score for the 4.10 system was essentially made up of the field
> weight, which is:
>score = tf * idf 
> 
> Whereas, in the 5.5 system, there is an additional "query weight", which
> is idf * query norm. If query norm is 1, then the final score is now:
>   score = query_weight * field_weight
>   = ( idf * 1 ) * (tf * idf)
>   = tf * idf^2
> 
> Can anyone explain why this new "query weight" element has appeared in
> our scores somewhere between 4.10 and 5.5?
> 
> Thanks!
> 
> Upayavira
> 
> 4.10 score 
>   "2937439": {
> "match": true,
> "value": 5.5993805,
> "description": "weight(description:obama in 394012)
> [DefaultSimilarity], result of:",
> "details": [
>   {
> "match": true,
> "value": 5.5993805,
> "description": "fieldWeight in 394012, product of:",
> "details": [
>   {
> "match": true,
> "value": 1,
> "description": "tf(freq=1.0), with freq of:",
> "details": [
>   {
> "match": true,
> "value": 1,
> "description": "termFreq=1.0"
>   }
> ]
>   },
>   {
> "match": true,
> "value": 5.5993805,
> "description": "idf(docFreq=56010, maxDocs=5568765)"
>   },
>   {
> "match": true,
> "value": 1,
> "description": "fieldNorm(doc=394012)"
>   }
> ]
>   }
> ]
> 5.5 score 
>   "2502281":{
> "match":true,
> "value":28.51136,
> "description":"weight(description:obama in 43472) [], result
> of:",
> "details":[{
> "match":true,
> "value":28.51136,
> "description":"score(doc=43472,freq=1.0), product of:",
> "details":[{
> "match":true,
> "value":5.339603,
> "description":"queryWeight, product of:",
> "details":[{
> "match":true,
> "value":5.339603,
> "description":"idf(docFreq=31905,
> maxDocs=2446459)"},
>   {
> "match":true,
> "value":1.0,
> "description":"queryNorm"}]},
>   {
> "match":true,
> "value":5.339603,
> "description":"fieldWeight in 43472, product of:",
> "details":[{
> "match":true,
> "value":1.0,
> "description":"tf(freq=1.0), with freq of:",
> "details":[{
> "match":true,
> "value":1.0,
> "description":"termFreq=1.0"}]},
>   {
> "match":true,
> "value":5.339603,
> "description":"idf(docFreq=31905,
> maxDocs=2446459)"},
>   {
> "match":true,
> "value":1.0,
> "description":"fieldNorm(doc=43472)"}]}]}]},


Re: Scoring changes between 4.10 and 5.5

2016-06-09 Thread Ahmet Arslan
Hi,

I wondered the same before and failed to decipher TFIDFSimilarity.
Scoring looks like tf*idf*idf to me.

I appreciate someone who will shed some light on this.

Thanks,
Ahmet



On Friday, June 10, 2016 12:37 AM, Upayavira  wrote:
I've just done a very simple, single term query against a 4.10 system
and a 5.5 system, each with much the same data.

The score for the 4.10 system was essentially made up of the field
weight, which is:
   score = tf * idf 

Whereas, in the 5.5 system, there is an additional "query weight", which
is idf * query norm. If query norm is 1, then the final score is now:
  score = query_weight * field_weight
  = ( idf * 1 ) * (tf * idf)
  = tf * idf^2

Can anyone explain why this new "query weight" element has appeared in
our scores somewhere between 4.10 and 5.5?

Thanks!

Upayavira

4.10 score 
  "2937439": {
"match": true,
"value": 5.5993805,
"description": "weight(description:obama in 394012)
[DefaultSimilarity], result of:",
"details": [
  {
"match": true,
"value": 5.5993805,
"description": "fieldWeight in 394012, product of:",
"details": [
  {
"match": true,
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
  {
"match": true,
"value": 1,
"description": "termFreq=1.0"
  }
]
  },
  {
"match": true,
"value": 5.5993805,
"description": "idf(docFreq=56010, maxDocs=5568765)"
  },
  {
"match": true,
"value": 1,
"description": "fieldNorm(doc=394012)"
  }
]
  }
]
5.5 score 
  "2502281":{
"match":true,
"value":28.51136,
"description":"weight(description:obama in 43472) [], result
of:",
"details":[{
"match":true,
"value":28.51136,
"description":"score(doc=43472,freq=1.0), product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"queryWeight, product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"queryNorm"}]},
  {
"match":true,
"value":5.339603,
"description":"fieldWeight in 43472, product of:",
"details":[{
"match":true,
"value":1.0,
"description":"tf(freq=1.0), with freq of:",
"details":[{
"match":true,
"value":1.0,
"description":"termFreq=1.0"}]},
  {
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"fieldNorm(doc=43472)"}]}]}]},


Scoring changes between 4.10 and 5.5

2016-06-09 Thread Upayavira
I've just done a very simple, single term query against a 4.10 system
and a 5.5 system, each with much the same data.

The score for the 4.10 system was essentially made up of the field
weight, which is:
   score = tf * idf 

Whereas, in the 5.5 system, there is an additional "query weight", which
is idf * query norm. If query norm is 1, then the final score is now:
  score = query_weight * field_weight
  = ( idf * 1 ) * (tf * idf)
  = tf * idf^2

Can anyone explain why this new "query weight" element has appeared in
our scores somewhere between 4.10 and 5.5?

Thanks!

Upayavira

4.10 score 
  "2937439": {
"match": true,
"value": 5.5993805,
"description": "weight(description:obama in 394012)
[DefaultSimilarity], result of:",
"details": [
  {
"match": true,
"value": 5.5993805,
"description": "fieldWeight in 394012, product of:",
"details": [
  {
"match": true,
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
  {
"match": true,
"value": 1,
"description": "termFreq=1.0"
  }
]
  },
  {
"match": true,
"value": 5.5993805,
"description": "idf(docFreq=56010, maxDocs=5568765)"
  },
  {
"match": true,
"value": 1,
"description": "fieldNorm(doc=394012)"
  }
]
  }
]
5.5 score 
  "2502281":{
"match":true,
"value":28.51136,
"description":"weight(description:obama in 43472) [], result
of:",
"details":[{
"match":true,
"value":28.51136,
"description":"score(doc=43472,freq=1.0), product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"queryWeight, product of:",
"details":[{
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"queryNorm"}]},
  {
"match":true,
"value":5.339603,
"description":"fieldWeight in 43472, product of:",
"details":[{
"match":true,
"value":1.0,
"description":"tf(freq=1.0), with freq of:",
"details":[{
"match":true,
"value":1.0,
"description":"termFreq=1.0"}]},
  {
"match":true,
"value":5.339603,
"description":"idf(docFreq=31905,
maxDocs=2446459)"},
  {
"match":true,
"value":1.0,
"description":"fieldNorm(doc=43472)"}]}]}]},