Hi Floyd, I don't think there is a general answer to that question. You would have to test it with your corpus/index and your queries. If you have that and if you can have 2 indices, one using BM25 and the other using VSM or anything else you want to compare, you would want to do some A/B testing and compare various metrics that indicates which search is better. Have a look at the picture on http://blog.sematext.com/2012/01/06/relevance-tuning-and-competitive-advantage-via-search-analytics/to see what I mean.
Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html On Fri, Nov 16, 2012 at 12:28 AM, Floyd Wu <floyd...@gmail.com> wrote: > Thanks everyone, especially to Tom, you do give me detailed explanation > about this topic. > Of course in academic we do need to interpret result carefully, what I care > about is from end-users point of view, using BM25 will result better > ranking instead of using lucene's original VSM+Boolean model? How > significant difference will be presented? > I'd like to see some sharing from community. > > Floyd > > > 2012/11/16 Tom Burton-West <tburt...@umich.edu> > > > Hello Floyd, > > > > There is a ton of research literature out there comparing BM25 to vector > > space. But you have to be careful interpreting it. > > > > BM25 originally beat the SMART vector space model in the early TRECs > > because it did better tf and length normalization. Pivoted Document > > Length normalization was invented to get the vector space model to catch > up > > to BM25. (Just Google for Singhal length normalization. Amith Singhal, > > now chief of Google Search did his doctoral thesis on this and it is > > available. Similarly Stephan Robertson, now at Microsoft Research > > published a ton of studies of BM25) > > > > The default Solr/Lucene similarity class doesn't provide the length or tf > > normalization tuning params that BM25 does. There is the sweetspot > > simliarity, but that doesn't quite work the same way that the BM25 > > normalizations do. > > > > Document length normalization needs and parameter tuning all depends on > > your data. So if you are reading a comparison, you need to determine: > > 1) When comparing recall/precision etc. between vector space and Bm25, > did > > the experimenter tune both the vector space and the BM25 parameters > > 2) Are the documents (and queries) they are using in the test, similar in > > length characteristics to your documents and > > queries. > > > > We are planning to do some testing in the next few months for our use > case, > > which is 10 million books where we index the entire book. These are > > extremely long documents compared to most IR research. > > I'd love to hear about actual (non-research) production implementations > > that have tested the new ranking models available in Solr. > > > > Tom > > > > > > > > On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu <floyd...@gmail.com> wrote: > > > > > Hi there, > > > Does anybody can kindly tell me how to setup solr to use BM25? > > > By the way, are there any experiment or research shows BM25 and > classical > > > VSM model comparison in recall/precision rate? > > > > > > Thanks in advanced. > > > > > >