problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi all, We realize that there is a bug in Lucene's ranking function. Most ranking functions, use a non-linear method to saturate the computation of the frequencies. This is due to the fact that the information gained on observing a term the first time is greater than the information gained on subs

Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
José, you might want to watch LUCENE-2392. In this issue, we are proposing adding additional flexibility to the scoring mechanism including: * controlling scoring on a per-field basis * the ability to compute and use aggregate statistics (average field length, total TF across all docs) * fine-grai

Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert, thank you very much for your quick response, I have a couple of questions, did you read the papers that I mention in my e-mail? do you think that Lucene ranking function could have this problem? My concern is not about how to implement different kind of ranking functions for Lucene, I

Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
2010/5/5 José Ramón Pérez Agüera > Hi Robert, > > thank you very much for your quick response, I have a couple of questions, > > did you read the papers that I mention in my e-mail? > Yes. > do you think that Lucene ranking function could have this problem? > > I know it does. > My concern i

Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert, the problem is not the linear combination of fields, the problem is to apply the boost factor per field after the term frequency saturation function and then make the linear combination of fields. Every system that implement BM25F, including terrier, take care of that, because if you do

Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
2010/5/5 José Ramón Pérez Agüera > Hi Robert, > > the problem is not the linear combination of fields, the problem is to > apply the boost factor per field after the term frequency saturation > function and then make the linear combination of fields. Every system > that implement BM25F, including

Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert, I will be very happy to see this problem fixed :-) I can not image what reasons people have to use software with bugs, I guess that others bugs in lucene are removed. Anyway, if finally you are going to fix the problem, these are good news :-) thank you very much for your time. jose O

Re: problem in Lucene's ranking function

2010-05-05 Thread Yonik Seeley
2010/5/5 José Ramón Pérez Agüera : [...] > The consequence is that a document > matching a single query term over several fields could score much > higher than a document matching several query terms in one field only, One partial workaround that people use is DisjunctionMaxQuery (used by "dismax"

Re: problem in Lucene's ranking function

2010-05-06 Thread José Ramón Pérez Agüera
thank you very much for your answer, but even trying to solve the problem at the boolean layer, the problem remains at ranking function, therefore the quality of the ranking would be very low, since term frequency function is not computed properly. jose On Wed, May 5, 2010 at 4:11 PM, Yonik Seele