subject:"Re\: BM25 model for solr 4\?"

Re: BM25 model for solr 4?

2012-11-16 Thread Otis Gospodnetic

Hi Floyd,

I don't think there is a general answer to that question. You would have
to test it with your corpus/index and your queries. If you have that and
if you can have 2 indices, one using BM25 and the other using VSM or
anything else you want to compare, you would want to do some A/B testing
and compare various metrics that indicates which search is better. Have a
look at the picture on
http://blog.sematext.com/2012/01/06/relevance-tuning-and-competitive-advantage-via-search-analytics/to
see what I mean.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html

On Fri, Nov 16, 2012 at 12:28 AM, Floyd Wu floyd...@gmail.com wrote:

Thanks everyone, especially to Tom, you do give me detailed explanation
about this topic.
Of course in academic we do need to interpret result carefully, what I care
about is from end-users point of view, using BM25 will result better
ranking instead of using lucene's original VSM+Boolean model? How
significant difference will be presented?
I'd like to see some sharing from community.

Floyd

2012/11/16 Tom Burton-West tburt...@umich.edu

Hello Floyd,

There is a ton of research literature out there comparing BM25 to vector
space. But you have to be careful interpreting it.

BM25 originally beat the SMART vector space model in the early TRECs
because it did better tf and length normalization. Pivoted Document
Length normalization was invented to get the vector space model to catch
up
to BM25. (Just Google for Singhal length normalization. Amith Singhal,
now chief of Google Search did his doctoral thesis on this and it is
available. Similarly Stephan Robertson, now at Microsoft Research
published a ton of studies of BM25)

The default Solr/Lucene similarity class doesn't provide the length or tf
normalization tuning params that BM25 does. There is the sweetspot
simliarity, but that doesn't quite work the same way that the BM25
normalizations do.

Document length normalization needs and parameter tuning all depends on
your data. So if you are reading a comparison, you need to determine:
1) When comparing recall/precision etc. between vector space and Bm25,
did
the experimenter tune both the vector space and the BM25 parameters
2) Are the documents (and queries) they are using in the test, similar in
length characteristics to your documents and
queries.

We are planning to do some testing in the next few months for our use
case,
which is 10 million books where we index the entire book. These are
extremely long documents compared to most IR research.
I'd love to hear about actual (non-research) production implementations
that have tested the new ranking models available in Solr.

Tom

On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu floyd...@gmail.com wrote:

Hi there,
Does anybody can kindly tell me how to setup solr to use BM25?
By the way, are there any experiment or research shows BM25 and
classical
VSM model comparison in recall/precision rate?

Thanks in advanced.

Re: BM25 model for solr 4?

2012-11-15 Thread Tom Burton-West

Hello Floyd,

There is a ton of research literature out there comparing BM25 to vector
space.  But you have to be careful interpreting it.

BM25 originally beat the SMART vector space model in the early  TRECs
 because it did better tf and length normalization.  Pivoted Document
Length normalization was invented to get the vector space model to catch up
to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
now chief of Google Search did his doctoral thesis on this and it is
available.  Similarly Stephan Robertson, now at Microsoft Research
published a ton of studies of BM25)

The default Solr/Lucene similarity class doesn't provide the length or tf
normalization tuning params that BM25 does.  There is the sweetspot
simliarity, but that doesn't quite work the same way that the BM25
normalizations do.

Document length normalization needs and parameter tuning all depends on
your data.  So if you are reading a comparison, you need to determine:
1) When comparing recall/precision etc. between vector space and Bm25, did
the experimenter tune both the vector space and the BM25 parameters
2) Are the documents (and queries) they are using in the test, similar in
 length characteristics to your documents and
queries.

We are planning to do some testing in the next few months for our use case,
which is 10 million books where we index the entire book.  These are
extremely long documents compared to most IR research.
I'd love to hear about actual (non-research) production implementations
that have tested the new ranking models available in Solr.

Tom



On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu floyd...@gmail.com wrote:

 Hi there,
 Does anybody can kindly tell me how to setup solr to use BM25?
 By the way, are there any experiment or research shows BM25 and classical
 VSM model comparison in recall/precision rate?

 Thanks in advanced.

Re: BM25 model for solr 4?

2012-11-15 Thread Floyd Wu

Thanks everyone, especially to Tom, you do give me detailed explanation
about this topic.
Of course in academic we do need to interpret result carefully, what I care
about is from end-users point of view, using BM25 will result better
ranking instead of using lucene's original VSM+Boolean model? How
significant difference will be presented?
I'd like to see some sharing from community.

Floyd


2012/11/16 Tom Burton-West tburt...@umich.edu

 Hello Floyd,

 There is a ton of research literature out there comparing BM25 to vector
 space.  But you have to be careful interpreting it.

 BM25 originally beat the SMART vector space model in the early  TRECs
  because it did better tf and length normalization.  Pivoted Document
 Length normalization was invented to get the vector space model to catch up
 to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
 now chief of Google Search did his doctoral thesis on this and it is
 available.  Similarly Stephan Robertson, now at Microsoft Research
 published a ton of studies of BM25)

 The default Solr/Lucene similarity class doesn't provide the length or tf
 normalization tuning params that BM25 does.  There is the sweetspot
 simliarity, but that doesn't quite work the same way that the BM25
 normalizations do.

 Document length normalization needs and parameter tuning all depends on
 your data.  So if you are reading a comparison, you need to determine:
 1) When comparing recall/precision etc. between vector space and Bm25, did
 the experimenter tune both the vector space and the BM25 parameters
 2) Are the documents (and queries) they are using in the test, similar in
  length characteristics to your documents and
 queries.

 We are planning to do some testing in the next few months for our use case,
 which is 10 million books where we index the entire book.  These are
 extremely long documents compared to most IR research.
 I'd love to hear about actual (non-research) production implementations
 that have tested the new ranking models available in Solr.

 Tom



 On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu floyd...@gmail.com wrote:

  Hi there,
  Does anybody can kindly tell me how to setup solr to use BM25?
  By the way, are there any experiment or research shows BM25 and classical
  VSM model comparison in recall/precision rate?
 
  Thanks in advanced.

Re: BM25 model for solr 4?

2012-11-14 Thread David Smiley (@MITRE.org)

See http://wiki.apache.org/solr/SchemaXml#Similarity

class=solr.BM25SimilarityFactory

The factories for these have javadocs that document the parameters:
http://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/search/similarities/package-summary.html

I don't know about comparisons between the choices available.  I'd love to
see one.

~ David Smiley



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/BM25-model-for-solr-4-tp4020400p4020411.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: BM25 model for solr 4?

2012-11-14 Thread Сергей Бирюков


There is good book http://nlp.stanford.edu/IR-book/

See chapter 
http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html




15.11.2012 06:16, Floyd Wu wrote:

Hi there,
Does anybody can kindly tell me how to setup solr to use BM25?
By the way, are there any experiment or research shows BM25 and classical
VSM model comparison in recall/precision rate?

Thanks in advanced.

Re: BM25 model for solr 4?

Re: BM25 model for solr 4?

Re: BM25 model for solr 4?

Re: BM25 model for solr 4?

Re: BM25 model for solr 4?

5 matches

Site Navigation

Mail list logo

Footer information