Thanks Jun!

On Fri, May 29, 2009 at 2:49 PM, Jun Rao <jun...@almaden.ibm.com> wrote:

> Reply inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> jun...@almaden.ibm.com
>
>
> Tenaali Ram <tenaali...@gmail.com> wrote on 05/28/2009 03:18:53 PM:
>
> > Hi,
> >
> > I am trying to understand the code of index package to build a
> distributed
> > Lucene index. I have some very basic questions and would really
> appreciate
> > if someone can help me understand this code-
> >
> > 1) If I already have Lucene index (divided into shards), should I upload
> > these indexes into HDFS and provide its location or the code will pick
> these
> > shards from local file system ?
>
> Yes, you need to put the old index to HDFS first.
>
> >
> > 2) How is the code adding a document in the lucene index, I can see there
> is
> > a index selection policy. Assuming round robin policy is chosen, how is
> the
> > code adding a document in the lucene index? This is related to first
> > question - is the index where the new document is to be added in HDFS or
> in
> > local file system. I read in the README that the index is first created
> on
> > local file system, then copied back to HDFS. Can someone please point me
> to
> > the code that is doing this.
> >
>
> See contrib.index.example.
>
> > 3) After the map reduce job finishes, where are the final indexes ? In
> HDFS
> > ?
>
> They will be in HDFS.
>
> >
> > 4) Correct me if I am wrong- the code builds multiple indexes, where each
> > index is an instance of Lucene Index having a disjoint subset of
> documents
> > from the corpus. So, if I have to search a term, I have to search each
> index
> > and then merge the result. If this is correct, then how is the IDF of a
> term
> > which is a global statistic computed and updated in each index ? I mean
> each
> > index can compute the IDF wrt. to the subset of documents that it has,
> but
> > can not compute the global IDF of a term (since it knows nothing about
> other
> > indexes, which might have the same term in other documents).
> >
>
> This package only deals with index builds. The shards are disjoint and it's
> up to the index server to calculate the ranks. For distributed TF/IDF
> support, you may want to look into Katta.
>
> > Thanks,
> > -T

Reply via email to