Re: Scoring changes between 4.10 and 5.5
Tracked it down to this ticket: https://issues.apache.org/jira/browse/LUCENE-6590 which changed the implementation of normalize() in org.apache.lucene.search.similarities.TFIDFSimilarity. I've asked for comment on that ticket. Upayavira On Fri, 10 Jun 2016, at 01:39 AM, Ahmet Arslan wrote: > Hi, > > I wondered the same before and failed to decipher TFIDFSimilarity. > Scoring looks like tf*idf*idf to me. > > I appreciate someone who will shed some light on this. > > Thanks, > Ahmet > > > > On Friday, June 10, 2016 12:37 AM, Upayavira wrote: > I've just done a very simple, single term query against a 4.10 system > and a 5.5 system, each with much the same data. > > The score for the 4.10 system was essentially made up of the field > weight, which is: >score = tf * idf > > Whereas, in the 5.5 system, there is an additional "query weight", which > is idf * query norm. If query norm is 1, then the final score is now: > score = query_weight * field_weight > = ( idf * 1 ) * (tf * idf) > = tf * idf^2 > > Can anyone explain why this new "query weight" element has appeared in > our scores somewhere between 4.10 and 5.5? > > Thanks! > > Upayavira > > 4.10 score > "2937439": { > "match": true, > "value": 5.5993805, > "description": "weight(description:obama in 394012) > [DefaultSimilarity], result of:", > "details": [ > { > "match": true, > "value": 5.5993805, > "description": "fieldWeight in 394012, product of:", > "details": [ > { > "match": true, > "value": 1, > "description": "tf(freq=1.0), with freq of:", > "details": [ > { > "match": true, > "value": 1, > "description": "termFreq=1.0" > } > ] > }, > { > "match": true, > "value": 5.5993805, > "description": "idf(docFreq=56010, maxDocs=5568765)" > }, > { > "match": true, > "value": 1, > "description": "fieldNorm(doc=394012)" > } > ] > } > ] > 5.5 score > "2502281":{ > "match":true, > "value":28.51136, > "description":"weight(description:obama in 43472) [], result > of:", > "details":[{ > "match":true, > "value":28.51136, > "description":"score(doc=43472,freq=1.0), product of:", > "details":[{ > "match":true, > "value":5.339603, > "description":"queryWeight, product of:", > "details":[{ > "match":true, > "value":5.339603, > "description":"idf(docFreq=31905, > maxDocs=2446459)"}, > { > "match":true, > "value":1.0, > "description":"queryNorm"}]}, > { > "match":true, > "value":5.339603, > "description":"fieldWeight in 43472, product of:", > "details":[{ > "match":true, > "value":1.0, > "description":"tf(freq=1.0), with freq of:", > "details":[{ > "match":true, > "value":1.0, > "description":"termFreq=1.0"}]}, > { > "match":true, > "value":5.339603, > "description":"idf(docFreq=31905, > maxDocs=2446459)"}, > { > "match":true, > "value":1.0, > "description":"fieldNorm(doc=43472)"}]}]}]},
Re: Scoring changes between 4.10 and 5.5
Hi, I wondered the same before and failed to decipher TFIDFSimilarity. Scoring looks like tf*idf*idf to me. I appreciate someone who will shed some light on this. Thanks, Ahmet On Friday, June 10, 2016 12:37 AM, Upayavira wrote: I've just done a very simple, single term query against a 4.10 system and a 5.5 system, each with much the same data. The score for the 4.10 system was essentially made up of the field weight, which is: score = tf * idf Whereas, in the 5.5 system, there is an additional "query weight", which is idf * query norm. If query norm is 1, then the final score is now: score = query_weight * field_weight = ( idf * 1 ) * (tf * idf) = tf * idf^2 Can anyone explain why this new "query weight" element has appeared in our scores somewhere between 4.10 and 5.5? Thanks! Upayavira 4.10 score "2937439": { "match": true, "value": 5.5993805, "description": "weight(description:obama in 394012) [DefaultSimilarity], result of:", "details": [ { "match": true, "value": 5.5993805, "description": "fieldWeight in 394012, product of:", "details": [ { "match": true, "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "match": true, "value": 1, "description": "termFreq=1.0" } ] }, { "match": true, "value": 5.5993805, "description": "idf(docFreq=56010, maxDocs=5568765)" }, { "match": true, "value": 1, "description": "fieldNorm(doc=394012)" } ] } ] 5.5 score "2502281":{ "match":true, "value":28.51136, "description":"weight(description:obama in 43472) [], result of:", "details":[{ "match":true, "value":28.51136, "description":"score(doc=43472,freq=1.0), product of:", "details":[{ "match":true, "value":5.339603, "description":"queryWeight, product of:", "details":[{ "match":true, "value":5.339603, "description":"idf(docFreq=31905, maxDocs=2446459)"}, { "match":true, "value":1.0, "description":"queryNorm"}]}, { "match":true, "value":5.339603, "description":"fieldWeight in 43472, product of:", "details":[{ "match":true, "value":1.0, "description":"tf(freq=1.0), with freq of:", "details":[{ "match":true, "value":1.0, "description":"termFreq=1.0"}]}, { "match":true, "value":5.339603, "description":"idf(docFreq=31905, maxDocs=2446459)"}, { "match":true, "value":1.0, "description":"fieldNorm(doc=43472)"}]}]}]},
Scoring changes between 4.10 and 5.5
I've just done a very simple, single term query against a 4.10 system and a 5.5 system, each with much the same data. The score for the 4.10 system was essentially made up of the field weight, which is: score = tf * idf Whereas, in the 5.5 system, there is an additional "query weight", which is idf * query norm. If query norm is 1, then the final score is now: score = query_weight * field_weight = ( idf * 1 ) * (tf * idf) = tf * idf^2 Can anyone explain why this new "query weight" element has appeared in our scores somewhere between 4.10 and 5.5? Thanks! Upayavira 4.10 score "2937439": { "match": true, "value": 5.5993805, "description": "weight(description:obama in 394012) [DefaultSimilarity], result of:", "details": [ { "match": true, "value": 5.5993805, "description": "fieldWeight in 394012, product of:", "details": [ { "match": true, "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "match": true, "value": 1, "description": "termFreq=1.0" } ] }, { "match": true, "value": 5.5993805, "description": "idf(docFreq=56010, maxDocs=5568765)" }, { "match": true, "value": 1, "description": "fieldNorm(doc=394012)" } ] } ] 5.5 score "2502281":{ "match":true, "value":28.51136, "description":"weight(description:obama in 43472) [], result of:", "details":[{ "match":true, "value":28.51136, "description":"score(doc=43472,freq=1.0), product of:", "details":[{ "match":true, "value":5.339603, "description":"queryWeight, product of:", "details":[{ "match":true, "value":5.339603, "description":"idf(docFreq=31905, maxDocs=2446459)"}, { "match":true, "value":1.0, "description":"queryNorm"}]}, { "match":true, "value":5.339603, "description":"fieldWeight in 43472, product of:", "details":[{ "match":true, "value":1.0, "description":"tf(freq=1.0), with freq of:", "details":[{ "match":true, "value":1.0, "description":"termFreq=1.0"}]}, { "match":true, "value":5.339603, "description":"idf(docFreq=31905, maxDocs=2446459)"}, { "match":true, "value":1.0, "description":"fieldNorm(doc=43472)"}]}]}]},