Thanks Jun! On Fri, May 29, 2009 at 2:49 PM, Jun Rao <jun...@almaden.ibm.com> wrote:
> Reply inlined below. > > Jun > IBM Almaden Research Center > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > jun...@almaden.ibm.com > > > Tenaali Ram <tenaali...@gmail.com> wrote on 05/28/2009 03:18:53 PM: > > > Hi, > > > > I am trying to understand the code of index package to build a > distributed > > Lucene index. I have some very basic questions and would really > appreciate > > if someone can help me understand this code- > > > > 1) If I already have Lucene index (divided into shards), should I upload > > these indexes into HDFS and provide its location or the code will pick > these > > shards from local file system ? > > Yes, you need to put the old index to HDFS first. > > > > > 2) How is the code adding a document in the lucene index, I can see there > is > > a index selection policy. Assuming round robin policy is chosen, how is > the > > code adding a document in the lucene index? This is related to first > > question - is the index where the new document is to be added in HDFS or > in > > local file system. I read in the README that the index is first created > on > > local file system, then copied back to HDFS. Can someone please point me > to > > the code that is doing this. > > > > See contrib.index.example. > > > 3) After the map reduce job finishes, where are the final indexes ? In > HDFS > > ? > > They will be in HDFS. > > > > > 4) Correct me if I am wrong- the code builds multiple indexes, where each > > index is an instance of Lucene Index having a disjoint subset of > documents > > from the corpus. So, if I have to search a term, I have to search each > index > > and then merge the result. If this is correct, then how is the IDF of a > term > > which is a global statistic computed and updated in each index ? I mean > each > > index can compute the IDF wrt. to the subset of documents that it has, > but > > can not compute the global IDF of a term (since it knows nothing about > other > > indexes, which might have the same term in other documents). > > > > This package only deals with index builds. The shards are disjoint and it's > up to the index server to calculate the ranks. For distributed TF/IDF > support, you may want to look into Katta. > > > Thanks, > > -T