Thanks everyone, especially to Tom, you do give me detailed explanation about this topic. Of course in academic we do need to interpret result carefully, what I care about is from end-users point of view, using BM25 will result better ranking instead of using lucene's original VSM+Boolean model? How significant difference will be presented? I'd like to see some sharing from community.
Floyd 2012/11/16 Tom Burton-West <tburt...@umich.edu> > Hello Floyd, > > There is a ton of research literature out there comparing BM25 to vector > space. But you have to be careful interpreting it. > > BM25 originally beat the SMART vector space model in the early TRECs > because it did better tf and length normalization. Pivoted Document > Length normalization was invented to get the vector space model to catch up > to BM25. (Just Google for Singhal length normalization. Amith Singhal, > now chief of Google Search did his doctoral thesis on this and it is > available. Similarly Stephan Robertson, now at Microsoft Research > published a ton of studies of BM25) > > The default Solr/Lucene similarity class doesn't provide the length or tf > normalization tuning params that BM25 does. There is the sweetspot > simliarity, but that doesn't quite work the same way that the BM25 > normalizations do. > > Document length normalization needs and parameter tuning all depends on > your data. So if you are reading a comparison, you need to determine: > 1) When comparing recall/precision etc. between vector space and Bm25, did > the experimenter tune both the vector space and the BM25 parameters > 2) Are the documents (and queries) they are using in the test, similar in > length characteristics to your documents and > queries. > > We are planning to do some testing in the next few months for our use case, > which is 10 million books where we index the entire book. These are > extremely long documents compared to most IR research. > I'd love to hear about actual (non-research) production implementations > that have tested the new ranking models available in Solr. > > Tom > > > > On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu <floyd...@gmail.com> wrote: > > > Hi there, > > Does anybody can kindly tell me how to setup solr to use BM25? > > By the way, are there any experiment or research shows BM25 and classical > > VSM model comparison in recall/precision rate? > > > > Thanks in advanced. > > >