Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

TheRanger Sat, 09 Dec 2006 05:24:36 -0800

Hello Lucene users,

in the past, I asked a number of times about the scoring that was applied for 
Lucene 1.2 (which might also still be valid in current Lucene versions). At 
that time I was interested only based on curiosity, but now I would need it in 
order to write proper documentation.

At that time, I found answer on a higher level with the kind help of Joaquin
Delgado in his posting (
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200609.mbox/[EMAIL
PROTECTED] ) who pointed me to this mailing list contribution (
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/[EMAIL
PROTECTED] ).

According to these sources, the Lucene scoring formula in version 1.2 is:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

where

* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.

This will allow me now to include the scoring formula as part of a
documentation which will of great help. For verification, I have attached the
formula as a picture generated from LaTeX. Please let me know if you find any
mistake or if it think the formula could be simplified (I am not a
mathematician...).

For even deeper understanding, I would like to ask a few further questions. I
am not an expert in Information Retrieval, so I hope my questions are not too
basic to be embarrassing. I read the paper by Erica Chisholm and Tamara G.
Kolda
(http://citeseer.ist.psu.edu/rd/12896645%2C198082%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/8004/http:zSzzSzwww.ca.sandia.govzSz%7EtgkoldazSzpaperszSzornl-tm-13756.pdf/new-term-weighting-formulas.pdf)
to get a better idea of what kind of vector space scoring strategies exist in
order to compare the Lucene scoring a bit with the rest of the world. My aim is
basically to understand the strategic decisions that where made in Lucene
(version 1.2). I have 3 questions:

1) tf_q and tf_d, basically all the term frequencies (TF) in the formula, are
square roots in order to normalise the bias from large term frequencies.
Looking through a number of IR papers, it seems that the "normal" way of
normalising TF is log. What is the motivation for choosing square root instead?
Is there a simple mathematical reason, or is there any empirical evidence that
this is the better strategy. Are there any papers that argue for this decision
(perhaps with empirical data or otherwise)?

2) In Lucenes scoring algorithm, the query part is normalised with norm_q which
is sqrt(sum_t((tf_q*idf_t)^2)). In standard IR literature, this is referred to
as Cosine Normalisation. The SMART system used this normalisation strategy,
however only for the documents, not for the query. Queries were not normalised
at all. The document terms in Lucene, on the other side, are only normalised
with norm_d_t : which is the square root of the number of tokens in d (which
are also terms in my case) in the same field as t. On this I have two sub
questions:

2a) Why does Lucene normalise with Cosine Normalisation for the query? In a
range of different IR system variations (as shown in Chisholm and Kolda's paper
in the table on page 8) queries where not normalised at all. Is there a good
reason or perhaps any empirical evidence that would support this decision?

2b) What is the motivation for the normalisation imposed on the documents
(norm_d_t) which I have not seen before in any other system. Again, does
anybody have pointers to literature on this?

3) What is the motivation for the additional normalisation of coord_q_d despite
what is already described above? Again, is there any literature that argues
this normalisation?

The answer of these questions would greatly help me to link this scoring
formula with other IR strategies. This would help me to appreciate the value of
this great IR library even more.

Any answer or partial answer on any of the questions would be greatly
appreciated!

Best Regards and thanks in advance!
Karl

--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Reply via email to