Re: Include BM25 in Lucene?

2006-10-19 Thread Vic Bancroft

Chuck Williams wrote:


Vic Bancroft wrote on 10/17/2006 02:44 AM:
 


In some of my group's usage of lucene over large document collections,
we have split the documents across several machines.  This has lead to
a concern of whether the inverse document frequency was appropriate,
since the score seems to be dependant on the partioning of documents
over indexing hosts.  We have not formulated an experiment to
determine if it seriously effects our results, though it has been
discussed.
   

What version of Lucene are you using?  

The current systems are based on 1.9.1, though I suspect we should clean 
up the deprecation warnings and move to 2.0.0.



Are you using ParallelMultiSearcher to manage the distributed indexes or have 
you
implemented your own mechanism?  

We had started with the ParallelMultiSearcher, but did not see 
appropriate scalability with high numbers of concurrent requests.  The 
bottleneck was on the reduce side, folding results back together.  The 
first cut mechanism we implemented allows for a configurable 
distribution of front end processors and is extremely efficient at the 
cost of (over) simplification.


Perhaps it is time to investigate the hadoop path . . .

There was a bug a couple years ago, in the 1.4.3 version as I recall, where 
ParallelMultiSearcher was not computing df's appropriately, but that has been
fixed for a long time now.  The df's are the sum of the df's from each 
distributed index and thus are independent of the partitioning.
 

Interesting, we randomly spray the documents across the leaf node 
indexers and rely on a tendancy of large numbers of documents to smooth 
out the probability distributions.   Hence my interest in participating 
in an effort to implement and evaluate the impact of using a different 
method, such as BM25 or perhaps even some DFR approach [1].


more,
l8r,
v

--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie

[1] http://ir.dcs.gla.ac.uk/terrier/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Include BM25 in Lucene?

2006-10-17 Thread J.Zhu
Hi, All,

I am an enthusiastic user of Lucene and it is very helpful to my
projects at hand. As probabilistic models such as BM25 are very popular
among research communities now, do you have any plan to incorporate some
of them in future Lucene release? I believe that will make Lucene even
more popular.

Jianhan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Include BM25 in Lucene?

2006-10-17 Thread Grant Ingersoll

Hi Jianhan,

I am not aware, however, of anyone working on a BM25 implementation.   
We are a volunteer project, though, so we are always open to  
contributions!


-Grant


On Oct 17, 2006, at 5:50 AM, J.Zhu wrote:


Hi, All,

I am an enthusiastic user of Lucene and it is very helpful to my
projects at hand. As probabilistic models such as BM25 are very  
popular
among research communities now, do you have any plan to incorporate  
some

of them in future Lucene release? I believe that will make Lucene even
more popular.

Jianhan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Include BM25 in Lucene?

2006-10-17 Thread J.Zhu
Hi, Grant,

If I would like to contribute, what should I do? I am not a good Java
developer myself though. Can I work with someone also interested?

Jianhan 

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: 17 October 2006 11:56
To: java-dev@lucene.apache.org
Subject: Re: Include BM25 in Lucene?

Hi Jianhan,

I am not aware, however, of anyone working on a BM25 implementation.   
We are a volunteer project, though, so we are always open to
contributions!

-Grant


On Oct 17, 2006, at 5:50 AM, J.Zhu wrote:

 Hi, All,

 I am an enthusiastic user of Lucene and it is very helpful to my 
 projects at hand. As probabilistic models such as BM25 are very 
 popular among research communities now, do you have any plan to 
 incorporate some of them in future Lucene release? I believe that will

 make Lucene even more popular.

 Jianhan

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Include BM25 in Lucene?

2006-10-17 Thread Vic Bancroft

J.Zhu wrote:


If I would like to contribute, what should I do? I am not a good Java
developer myself though. Can I work with someone also interested?
 

In some of my group's usage of lucene over large document collections, 
we have split the documents across several machines.  This has lead to a 
concern of whether the inverse document frequency was appropriate, since 
the score seems to be dependant on the partioning of documents over 
indexing hosts.  We have not formulated an experiment to determine if it 
seriously effects our results, though it has been discussed.


If someone could elaborate how BM25 or some DFR algorithm would differ 
from what (TF/IDF) is implemented in lucene, I would be willing to help 
translate that into java as an indexing/searching option . . .


more,
l8r,
v


--
The future is here. It's just not evenly distributed yet.
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Include BM25 in Lucene?

2006-10-17 Thread J.Zhu
Hi, Vic,

Unfortunately BM25 uses IDF as well so splitting documents across
machines will also affect it. How about storing these as global
statistical data for sharing the search on these machines?

The equation of BM25 is clearly stated in Robertson's paper Simple,
proven approaches to text retrieval
(http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-356.pdf) as follows.

CW (i,j) = [ CFW (i) * TF (i,j) * (K1+1) ] /[ K1 * ( (1-b) + (b * (NDL
(j)) ) ) + TF (i,j) ]
CFW(i) is collection frequency weight of term i, TF(i,j) is term
frequency of term i, NDL(j) is the normalized document length of
document j, and K1 and b are tuning constants. The details are in the
paper.

Univ. of Amsterdam has provided a downloadable version of a language
modelling version of Lucene. Their language model is not BM25 but is
quite similar in nature. The version is at:
http://ilps.science.uva.nl/Resources/#lm-lucen

I have worked on their version a bit, they have created new classes:
TermQueryLanguageModel, TermScorerLanguageModel,
IndexSearcherLanguageModel, LanguageModelIndexReader etc. I think their
work can be a basis.

Jianhan

-Original Message-
From: Vic Bancroft [mailto:[EMAIL PROTECTED] 
Sent: 17 October 2006 13:44
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Include BM25 in Lucene?

J.Zhu wrote:

If I would like to contribute, what should I do? I am not a good Java 
developer myself though. Can I work with someone also interested?
  

In some of my group's usage of lucene over large document collections,
we have split the documents across several machines.  This has lead to a
concern of whether the inverse document frequency was appropriate, since
the score seems to be dependant on the partioning of documents over
indexing hosts.  We have not formulated an experiment to determine if it
seriously effects our results, though it has been discussed.

If someone could elaborate how BM25 or some DFR algorithm would differ
from what (TF/IDF) is implemented in lucene, I would be willing to help
translate that into java as an indexing/searching option . . .

more,
l8r,
v


--
The future is here. It's just not evenly distributed yet.
 -- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Include BM25 in Lucene?

2006-10-17 Thread Chuck Williams
Vic Bancroft wrote on 10/17/2006 02:44 AM:
 In some of my group's usage of lucene over large document collections,
 we have split the documents across several machines.  This has lead to
 a concern of whether the inverse document frequency was appropriate,
 since the score seems to be dependant on the partioning of documents
 over indexing hosts.  We have not formulated an experiment to
 determine if it seriously effects our results, though it has been
 discussed.

What version of Lucene are you using?  Are you using
ParallelMultiSearcher to manage the distributed indexes or have you
implemented your own mechanism?  There was a bug a couple years ago, in
the 1.4.3 version as I recall, where ParallelMultiSearcher was not
computing df's appropriately, but that has been fixed for a long time
now.  The df's are the sum of the df's from each distributed index and
thus are independent of the partitioning.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]