Hi all,

We realize that there is a bug in Lucene's ranking function. Most
ranking functions, use a non-linear method to saturate the computation
of the frequencies.
This is due to the fact that the information gained on observing a
term the first time is greater than the information gained on
subsequently seeing the same term. The non-linear method can be as
simple as a logarithmic or a square-root function or more complex
parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
http://portal.acm.org/citation.cfm?id=1031181 has described the
dangers to combine scores from different document fields and what are
the most tipical errors when ranking functions are modified to
consider the structure of the documents.

To rank these structured documents, Lucene combines the scores from
document fields. The method used by Lucene to compute the score of an
structured document is based on the linear combination of the scores
for each field of the document.

Lucene's ranking function uses the square root of the term frequency
to implement the non-linear method to saturate the computation of the
frequencies, but the linear combination of the scores by field to
compute the score for the whole document that Lucene implements breaks
the saturation effect, since field's boost factors are applied after
of non-linear methods are used. The consequence is that a document
matching a single query term over several fields could score much
higher than a document matching several query terms in one field only,
which is not a good way to compute relevance and use to hurt
dramatically ranking function performance.

We have written a paper where this problem is described and some
experiments are carried out to show the effect in Lucene performance.
http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf

It would be possible to fix this problem to have Lucene working
properly for structured documents?

thank you very much in advance

jose

-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to