Re: BM25 model for solr 4?
Hi Floyd, I don't think there is a general answer to that question. You would have to test it with your corpus/index and your queries. If you have that and if you can have 2 indices, one using BM25 and the other using VSM or anything else you want to compare, you would want to do some A/B testing and compare various metrics that indicates which search is better. Have a look at the picture on http://blog.sematext.com/2012/01/06/relevance-tuning-and-competitive-advantage-via-search-analytics/to see what I mean. Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html On Fri, Nov 16, 2012 at 12:28 AM, Floyd Wu wrote: > Thanks everyone, especially to Tom, you do give me detailed explanation > about this topic. > Of course in academic we do need to interpret result carefully, what I care > about is from end-users point of view, using BM25 will result better > ranking instead of using lucene's original VSM+Boolean model? How > significant difference will be presented? > I'd like to see some sharing from community. > > Floyd > > > 2012/11/16 Tom Burton-West > > > Hello Floyd, > > > > There is a ton of research literature out there comparing BM25 to vector > > space. But you have to be careful interpreting it. > > > > BM25 originally beat the SMART vector space model in the early TRECs > > because it did better tf and length normalization. Pivoted Document > > Length normalization was invented to get the vector space model to catch > up > > to BM25. (Just Google for Singhal length normalization. Amith Singhal, > > now chief of Google Search did his doctoral thesis on this and it is > > available. Similarly Stephan Robertson, now at Microsoft Research > > published a ton of studies of BM25) > > > > The default Solr/Lucene similarity class doesn't provide the length or tf > > normalization tuning params that BM25 does. There is the sweetspot > > simliarity, but that doesn't quite work the same way that the BM25 > > normalizations do. > > > > Document length normalization needs and parameter tuning all depends on > > your data. So if you are reading a comparison, you need to determine: > > 1) When comparing recall/precision etc. between vector space and Bm25, > did > > the experimenter tune both the vector space and the BM25 parameters > > 2) Are the documents (and queries) they are using in the test, similar in > > length characteristics to your documents and > > queries. > > > > We are planning to do some testing in the next few months for our use > case, > > which is 10 million books where we index the entire book. These are > > extremely long documents compared to most IR research. > > I'd love to hear about actual (non-research) production implementations > > that have tested the new ranking models available in Solr. > > > > Tom > > > > > > > > On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu wrote: > > > > > Hi there, > > > Does anybody can kindly tell me how to setup solr to use BM25? > > > By the way, are there any experiment or research shows BM25 and > classical > > > VSM model comparison in recall/precision rate? > > > > > > Thanks in advanced. > > > > > >
Re: BM25 model for solr 4?
Thanks everyone, especially to Tom, you do give me detailed explanation about this topic. Of course in academic we do need to interpret result carefully, what I care about is from end-users point of view, using BM25 will result better ranking instead of using lucene's original VSM+Boolean model? How significant difference will be presented? I'd like to see some sharing from community. Floyd 2012/11/16 Tom Burton-West > Hello Floyd, > > There is a ton of research literature out there comparing BM25 to vector > space. But you have to be careful interpreting it. > > BM25 originally beat the SMART vector space model in the early TRECs > because it did better tf and length normalization. Pivoted Document > Length normalization was invented to get the vector space model to catch up > to BM25. (Just Google for Singhal length normalization. Amith Singhal, > now chief of Google Search did his doctoral thesis on this and it is > available. Similarly Stephan Robertson, now at Microsoft Research > published a ton of studies of BM25) > > The default Solr/Lucene similarity class doesn't provide the length or tf > normalization tuning params that BM25 does. There is the sweetspot > simliarity, but that doesn't quite work the same way that the BM25 > normalizations do. > > Document length normalization needs and parameter tuning all depends on > your data. So if you are reading a comparison, you need to determine: > 1) When comparing recall/precision etc. between vector space and Bm25, did > the experimenter tune both the vector space and the BM25 parameters > 2) Are the documents (and queries) they are using in the test, similar in > length characteristics to your documents and > queries. > > We are planning to do some testing in the next few months for our use case, > which is 10 million books where we index the entire book. These are > extremely long documents compared to most IR research. > I'd love to hear about actual (non-research) production implementations > that have tested the new ranking models available in Solr. > > Tom > > > > On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu wrote: > > > Hi there, > > Does anybody can kindly tell me how to setup solr to use BM25? > > By the way, are there any experiment or research shows BM25 and classical > > VSM model comparison in recall/precision rate? > > > > Thanks in advanced. > > >
Re: BM25 model for solr 4?
Hello Floyd, There is a ton of research literature out there comparing BM25 to vector space. But you have to be careful interpreting it. BM25 originally beat the SMART vector space model in the early TRECs because it did better tf and length normalization. Pivoted Document Length normalization was invented to get the vector space model to catch up to BM25. (Just Google for Singhal length normalization. Amith Singhal, now chief of Google Search did his doctoral thesis on this and it is available. Similarly Stephan Robertson, now at Microsoft Research published a ton of studies of BM25) The default Solr/Lucene similarity class doesn't provide the length or tf normalization tuning params that BM25 does. There is the sweetspot simliarity, but that doesn't quite work the same way that the BM25 normalizations do. Document length normalization needs and parameter tuning all depends on your data. So if you are reading a comparison, you need to determine: 1) When comparing recall/precision etc. between vector space and Bm25, did the experimenter tune both the vector space and the BM25 parameters 2) Are the documents (and queries) they are using in the test, similar in length characteristics to your documents and queries. We are planning to do some testing in the next few months for our use case, which is 10 million books where we index the entire book. These are extremely long documents compared to most IR research. I'd love to hear about actual (non-research) production implementations that have tested the new ranking models available in Solr. Tom On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu wrote: > Hi there, > Does anybody can kindly tell me how to setup solr to use BM25? > By the way, are there any experiment or research shows BM25 and classical > VSM model comparison in recall/precision rate? > > Thanks in advanced. >
Re: BM25 model for solr 4?
There is good book http://nlp.stanford.edu/IR-book/ See chapter http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html 15.11.2012 06:16, Floyd Wu wrote: Hi there, Does anybody can kindly tell me how to setup solr to use BM25? By the way, are there any experiment or research shows BM25 and classical VSM model comparison in recall/precision rate? Thanks in advanced.
Re: BM25 model for solr 4?
See http://wiki.apache.org/solr/SchemaXml#Similarity class="solr.BM25SimilarityFactory" The factories for these have javadocs that document the parameters: http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/search/similarities/package-summary.html I don't know about comparisons between the choices available. I'd love to see one. ~ David Smiley - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/BM25-model-for-solr-4-tp4020400p4020411.html Sent from the Solr - User mailing list archive at Nabble.com.
BM25 model for solr 4?
Hi there, Does anybody can kindly tell me how to setup solr to use BM25? By the way, are there any experiment or research shows BM25 and classical VSM model comparison in recall/precision rate? Thanks in advanced.