As an aside, is there a reason that idf is squared in each Term and
Phrase match (it is multiplied both into the query component and the
field component)?  To compensate for this, I'm taking the square root of
the idf I really want in my Similarity, which seems strange.

Hi Chuck,

that's a very good question. And you are right, it may be a bug, I am
not sure about it. I stumbled over this several times when studying
code in the search package. It's a little bit difficult to explain since
the code for score computation is distributed over Weight and Scorer
classes. It seems that a TermQuery and a PhraseQuery weight is
multiplied with idf twice, first in sumOfSquaredWeights() and then in
normalize. That's what you discovered.

The formula in Similarity Javadoc does not describe the scoring completely.
I try to write down the formula that exactly describes the current
implementation. Then we can start a discussion and people could decide
whether this is the intended scoring. (I assume DefaultSimilarity here)

Lt's start with the simple case. A pure TermQuery (one word query) gets
the following score after cancelling down queryNorm(t) and queryBoost(t)
(coord is 1 here)

t: TermQuery
d: document

score(t, d) =
 tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field in d)

Note that fieldBoost and lengthNorm are both combined in norms.

For a BooleanQuery consisting of several TermQueries we get the following:
(Again we can cancel down queryBoost(q))

q: BooleanQuery
t: Term and corresponding TermQuery
d: document

score(q, d) = coord(q, d) * queryNorm(q) *
 SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * fieldBoost(t.field in d)
   * lengthFieldNorm(t.field in d) )

where
coord(q, d) = "fraction of TermQueries occuring in d"
queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) )

I hope this starts a discussion.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to