As an aside, is there a reason that idf is squared in each Term and Phrase match (it is multiplied both into the query component and the field component)? To compensate for this, I'm taking the square root of the idf I really want in my Similarity, which seems strange.
Hi Chuck,
that's a very good question. And you are right, it may be a bug, I am not sure about it. I stumbled over this several times when studying code in the search package. It's a little bit difficult to explain since the code for score computation is distributed over Weight and Scorer classes. It seems that a TermQuery and a PhraseQuery weight is multiplied with idf twice, first in sumOfSquaredWeights() and then in normalize. That's what you discovered.
The formula in Similarity Javadoc does not describe the scoring completely. I try to write down the formula that exactly describes the current implementation. Then we can start a discussion and people could decide whether this is the intended scoring. (I assume DefaultSimilarity here)
Lt's start with the simple case. A pure TermQuery (one word query) gets the following score after cancelling down queryNorm(t) and queryBoost(t) (coord is 1 here)
t: TermQuery d: document
score(t, d) = tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d)
Note that fieldBoost and lengthNorm are both combined in norms.
For a BooleanQuery consisting of several TermQueries we get the following: (Again we can cancel down queryBoost(q))
q: BooleanQuery t: Term and corresponding TermQuery d: document
score(q, d) = coord(q, d) * queryNorm(q) * SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d) )
where coord(q, d) = "fraction of TermQueries occuring in d" queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) )
I hope this starts a discussion.
Christoph
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]