Hello Lucene users,

in the past, I asked a number of times about the scoring that was applied for 
Lucene 1.2 (which might also still be valid in current Lucene versions). At 
that time I was interested only based on curiosity, but now I would need it in 
order to write proper documentation.

At that time, I found answer on a higher level with the kind help of Joaquin 
Delgado in his posting ( 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200609.mbox/[EMAIL 
PROTECTED] ) who pointed me to this mailing list contribution ( 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/[EMAIL 
PROTECTED] ).

According to these sources, the Lucene scoring formula in version 1.2 is:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * 
boost_t) * coord_q_d

where

    * score (q,d) : score for document d given query q
    * sum_t : sum for all terms t in q
    * tf_q : the square root of the frequency of t in q
    * tf_d : the square root of the frequency of t in d
    * idf_t : log(numDocs/docFreq_t+1) + 1.0
    * numDocs : number of documents in index
    * docFreq_t : number of documents containing t
    * norm_q : sqrt(sum_t((tf_q*idf_t)^2))
    * norm_d_t : square root of number of tokens in d in the same field
      as t
    * boost_t : the user-specified boost for term t
    * coord_q_d : number of terms in both query and document / number of
      terms in query The coordination factor gives an AND-like boost to
      documents that contain, e.g., all three terms in a three word
      query over those that contain just two of the words.


This will allow me now to include the scoring formula as part of a 
documentation which will of great help. For verification, I have attached the 
formula as a picture generated from LaTeX. Please let me know if you find any 
mistake or if it think the formula could be simplified (I am not a 
mathematician...).

For even deeper understanding, I would like to ask a few further questions. I 
am not an expert in Information Retrieval, so I hope my questions are not too 
basic to be embarrassing. I read the paper by Erica Chisholm and Tamara G. 
Kolda 
(http://citeseer.ist.psu.edu/rd/12896645%2C198082%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/8004/http:zSzzSzwww.ca.sandia.govzSz%7EtgkoldazSzpaperszSzornl-tm-13756.pdf/new-term-weighting-formulas.pdf)
 to get a better idea of what kind of vector space scoring strategies exist in 
order to compare the Lucene scoring a bit with the rest of the world. My aim is 
basically to understand the strategic decisions that where made in Lucene 
(version 1.2). I have 3 questions: 

1) tf_q and tf_d, basically all the term frequencies (TF) in the formula, are 
square roots in order to normalise the bias from large term frequencies. 
Looking through a number of IR papers, it seems that the "normal" way of 
normalising TF is log. What is the motivation for choosing square root instead? 
Is there a simple mathematical reason, or is there any empirical evidence that 
this is the better strategy. Are there any papers that argue for this decision 
(perhaps with empirical data or otherwise)?

2) In Lucenes scoring algorithm, the query part is normalised with norm_q which 
is sqrt(sum_t((tf_q*idf_t)^2)). In standard IR literature, this is referred to 
as Cosine Normalisation. The SMART system used this normalisation strategy, 
however only for the documents, not for the query. Queries were not normalised 
at all. The document terms in Lucene, on the other side, are only normalised 
with  norm_d_t : which is the square root of the number of tokens in d (which 
are also terms in my case) in the same field as t. On this I have two sub 
questions:

2a) Why does Lucene normalise with Cosine Normalisation for the query? In a 
range of different IR system variations (as shown in Chisholm and Kolda's paper 
in the table on page 8) queries where not normalised at all. Is there a good 
reason or perhaps any empirical evidence that would support this decision?

2b) What is the motivation for the normalisation imposed on the documents 
(norm_d_t) which I have not seen before in any other system. Again, does 
anybody have pointers to literature on this?

3) What is the motivation for the additional normalisation of coord_q_d despite 
what is already described above? Again, is there any literature that argues 
this normalisation?

The answer of these questions would greatly help me to link this scoring 
formula with other IR strategies. This would help me to appreciate the value of 
this great IR library even more. 

Any answer or partial answer on any of the questions would be greatly 
appreciated!

Best Regards and thanks in advance!
Karl

-- 
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to