Hi Robert, the problem is not the linear combination of fields, the problem is to apply the boost factor per field after the term frequency saturation function and then make the linear combination of fields. Every system that implement BM25F, including terrier, take care of that, because if you don't do it you have a bug in your ranking function and not just a different ranking function.
It is very easy solve the problem in the current Lucene ranking function, just move the boost factor per field inside the term frequency square root, that's all. If you implement this little change, Lucene ranking fucntion will work properly with structured documents and all your other concerns about allowing users to implement different ranking functions for different situations will be not affected by this change. I really appreciate your work to improve Lucene ranking function, and your time to response this emails :-) best jose On Wed, May 5, 2010 at 2:12 PM, Robert Muir <rcm...@gmail.com> wrote: > 2010/5/5 José Ramón Pérez Agüera <jose.agu...@gmail.com> > >> Hi Robert, >> >> thank you very much for your quick response, I have a couple of questions, >> >> did you read the papers that I mention in my e-mail? >> > > Yes. > > >> do you think that Lucene ranking function could have this problem? >> >> > I know it does. > > >> My concern is not about how to implement different kind of ranking >> functions for Lucene, I know that you are doing a very nice work to >> implement a very flexible ranking framework for Lucene, my concern is >> about a bug, which is independent of the ranking function that you are >> using and which appears whether some kind of saturation function is >> used in combination with a linear combination of fields for structured >> documents. >> > > I think we might disagree here though. Must 'the combining of scores from > different fields' must be hardcoded to one simple solution, or should it be > something that you can control yourself? > > For example, it appears Terrier implements something different for this > problem, not the paper you referenced but a different technique?: > http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html > But > I don't quite understand all the subleties involved... it seems in this > other paper there is still a linear combination, but you introduce > additional per-field parameters. > > The thing that makes me nervous about "hardcoding/changing" the way that > scores are combined across fields is that Lucene presents some strange > peculiarities, most notably the ability to use different scoring models for > different fields. This in fact already exists today, if you "omitTF" for one > field but not for another, you are using a different scoring model for the > two fields. > > >> Maybe I'm wrong, but if the linear combination of fields remains in >> lucene ranking function core, Lucene is never going to work properly >> to compute the score for structured documents. >> > > I wouldn't say never, maybe we will not get there in the first go, but > hopefully at least you will be able to do the things i mentioned above, such > as using different similarities for different fields, including ones that > are not supported today. > > >> >> I know how to solve the problem, and we have our own implementation of >> BM25F for Lucene which performance is much better that standard >> Lucene's ranking function, but I think that would be useful for other >> Lucene users to know what is the problem to deal with structured >> documents, and how to fix this problem for the next version, >> independently what ranking function is finally implemented for Lucene. >> >> > It would be great if you could help us on that issue (I know the patch is a > bit out of date), to try to fix the scoring APIs, including perhaps thinking > about how to improve search across multiple fields for structured documents. > > In my opinion, I would like to see the situation evolve away from "which > ranking function is implemented for Lucene" instead to having a variety of > built-in functions you can choose from. > > So, I would rather it be more like Analyzers, where we have a variety of > high-quality implementations available, and you can make your own if you > must, but there is no real default. > > -- > Robert Muir > rcm...@gmail.com > -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org