Hi all, We realize that there is a bug in Lucene's ranking function. Most ranking functions, use a non-linear method to saturate the computation of the frequencies. This is due to the fact that the information gained on observing a term the first time is greater than the information gained on subsequently seeing the same term. The non-linear method can be as simple as a logarithmic or a square-root function or more complex parameter-based approaches like BM25 k1 parameter. S. Robertson 2004 http://portal.acm.org/citation.cfm?id=1031181 has described the dangers to combine scores from different document fields and what are the most tipical errors when ranking functions are modified to consider the structure of the documents.
To rank these structured documents, Lucene combines the scores from document fields. The method used by Lucene to compute the score of an structured document is based on the linear combination of the scores for each field of the document. Lucene's ranking function uses the square root of the term frequency to implement the non-linear method to saturate the computation of the frequencies, but the linear combination of the scores by field to compute the score for the whole document that Lucene implements breaks the saturation effect, since field's boost factors are applied after of non-linear methods are used. The consequence is that a document matching a single query term over several fields could score much higher than a document matching several query terms in one field only, which is not a good way to compute relevance and use to hurt dramatically ranking function performance. We have written a paper where this problem is described and some experiments are carried out to show the effect in Lucene performance. http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf It would be possible to fix this problem to have Lucene working properly for structured documents? thank you very much in advance jose -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: [email protected] Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
