2010/5/5 José Ramón Pérez Agüera <jose.agu...@gmail.com> > Hi Robert, > > thank you very much for your quick response, I have a couple of questions, > > did you read the papers that I mention in my e-mail? >
Yes. > do you think that Lucene ranking function could have this problem? > > I know it does. > My concern is not about how to implement different kind of ranking > functions for Lucene, I know that you are doing a very nice work to > implement a very flexible ranking framework for Lucene, my concern is > about a bug, which is independent of the ranking function that you are > using and which appears whether some kind of saturation function is > used in combination with a linear combination of fields for structured > documents. > I think we might disagree here though. Must 'the combining of scores from different fields' must be hardcoded to one simple solution, or should it be something that you can control yourself? For example, it appears Terrier implements something different for this problem, not the paper you referenced but a different technique?: http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html But I don't quite understand all the subleties involved... it seems in this other paper there is still a linear combination, but you introduce additional per-field parameters. The thing that makes me nervous about "hardcoding/changing" the way that scores are combined across fields is that Lucene presents some strange peculiarities, most notably the ability to use different scoring models for different fields. This in fact already exists today, if you "omitTF" for one field but not for another, you are using a different scoring model for the two fields. > Maybe I'm wrong, but if the linear combination of fields remains in > lucene ranking function core, Lucene is never going to work properly > to compute the score for structured documents. > I wouldn't say never, maybe we will not get there in the first go, but hopefully at least you will be able to do the things i mentioned above, such as using different similarities for different fields, including ones that are not supported today. > > I know how to solve the problem, and we have our own implementation of > BM25F for Lucene which performance is much better that standard > Lucene's ranking function, but I think that would be useful for other > Lucene users to know what is the problem to deal with structured > documents, and how to fix this problem for the next version, > independently what ranking function is finally implemented for Lucene. > > It would be great if you could help us on that issue (I know the patch is a bit out of date), to try to fix the scoring APIs, including perhaps thinking about how to improve search across multiple fields for structured documents. In my opinion, I would like to see the situation evolve away from "which ranking function is implemented for Lucene" instead to having a variety of built-in functions you can choose from. So, I would rather it be more like Analyzers, where we have a variety of high-quality implementations available, and you can make your own if you must, but there is no real default. -- Robert Muir rcm...@gmail.com