Question: index package in contrib (lucene index)

2009-05-29 Thread Tenaali Ram
Anyone ?
Any help to understand this package is appreciated.

Thanks,
T

On Thu, May 28, 2009 at 3:18 PM, Tenaali Ram tenaali...@gmail.com wrote:

 Hi,

 I am trying to understand the code of index package to build a distributed
 Lucene index. I have some very basic questions and would really appreciate
 if someone can help me understand this code-

 1) If I already have Lucene index (divided into shards), should I upload
 these indexes into HDFS and provide its location or the code will pick these
 shards from local file system ?

 2) How is the code adding a document in the lucene index, I can see there
 is a index selection policy. Assuming round robin policy is chosen, how is
 the code adding a document in the lucene index? This is related to first
 question - is the index where the new document is to be added in HDFS or in
 local file system. I read in the README that the index is first created on
 local file system, then copied back to HDFS. Can someone please point me to
 the code that is doing this.

 3) After the map reduce job finishes, where are the final indexes ? In HDFS
 ?

 4) Correct me if I am wrong- the code builds multiple indexes, where each
 index is an instance of Lucene Index having a disjoint subset of documents
 from the corpus. So, if I have to search a term, I have to search each index
 and then merge the result. If this is correct, then how is the IDF of a term
 which is a global statistic computed and updated in each index ? I mean each
 index can compute the IDF wrt. to the subset of documents that it has, but
 can not compute the global IDF of a term (since it knows nothing about other
 indexes, which might have the same term in other documents).

 Thanks,
 -T





Re: Question: index package in contrib (lucene index)

2009-05-29 Thread Jun Rao
Reply inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

jun...@almaden.ibm.com


Tenaali Ram tenaali...@gmail.com wrote on 05/28/2009 03:18:53 PM:

 Hi,

 I am trying to understand the code of index package to build a
distributed
 Lucene index. I have some very basic questions and would really
appreciate
 if someone can help me understand this code-

 1) If I already have Lucene index (divided into shards), should I upload
 these indexes into HDFS and provide its location or the code will pick
these
 shards from local file system ?

Yes, you need to put the old index to HDFS first.


 2) How is the code adding a document in the lucene index, I can see there
is
 a index selection policy. Assuming round robin policy is chosen, how is
the
 code adding a document in the lucene index? This is related to first
 question - is the index where the new document is to be added in HDFS or
in
 local file system. I read in the README that the index is first created
on
 local file system, then copied back to HDFS. Can someone please point me
to
 the code that is doing this.


See contrib.index.example.

 3) After the map reduce job finishes, where are the final indexes ? In
HDFS
 ?

They will be in HDFS.


 4) Correct me if I am wrong- the code builds multiple indexes, where each
 index is an instance of Lucene Index having a disjoint subset of
documents
 from the corpus. So, if I have to search a term, I have to search each
index
 and then merge the result. If this is correct, then how is the IDF of a
term
 which is a global statistic computed and updated in each index ? I mean
each
 index can compute the IDF wrt. to the subset of documents that it has,
but
 can not compute the global IDF of a term (since it knows nothing about
other
 indexes, which might have the same term in other documents).


This package only deals with index builds. The shards are disjoint and it's
up to the index server to calculate the ranks. For distributed TF/IDF
support, you may want to look into Katta.

 Thanks,
 -T

Re: Question: index package in contrib (lucene index)

2009-05-29 Thread Tenaali Ram
Thanks Jun!

On Fri, May 29, 2009 at 2:49 PM, Jun Rao jun...@almaden.ibm.com wrote:

 Reply inlined below.

 Jun
 IBM Almaden Research Center
 K55/B1, 650 Harry Road, San Jose, CA  95120-6099

 jun...@almaden.ibm.com


 Tenaali Ram tenaali...@gmail.com wrote on 05/28/2009 03:18:53 PM:

  Hi,
 
  I am trying to understand the code of index package to build a
 distributed
  Lucene index. I have some very basic questions and would really
 appreciate
  if someone can help me understand this code-
 
  1) If I already have Lucene index (divided into shards), should I upload
  these indexes into HDFS and provide its location or the code will pick
 these
  shards from local file system ?

 Yes, you need to put the old index to HDFS first.

 
  2) How is the code adding a document in the lucene index, I can see there
 is
  a index selection policy. Assuming round robin policy is chosen, how is
 the
  code adding a document in the lucene index? This is related to first
  question - is the index where the new document is to be added in HDFS or
 in
  local file system. I read in the README that the index is first created
 on
  local file system, then copied back to HDFS. Can someone please point me
 to
  the code that is doing this.
 

 See contrib.index.example.

  3) After the map reduce job finishes, where are the final indexes ? In
 HDFS
  ?

 They will be in HDFS.

 
  4) Correct me if I am wrong- the code builds multiple indexes, where each
  index is an instance of Lucene Index having a disjoint subset of
 documents
  from the corpus. So, if I have to search a term, I have to search each
 index
  and then merge the result. If this is correct, then how is the IDF of a
 term
  which is a global statistic computed and updated in each index ? I mean
 each
  index can compute the IDF wrt. to the subset of documents that it has,
 but
  can not compute the global IDF of a term (since it knows nothing about
 other
  indexes, which might have the same term in other documents).
 

 This package only deals with index builds. The shards are disjoint and it's
 up to the index server to calculate the ranks. For distributed TF/IDF
 support, you may want to look into Katta.

  Thanks,
  -T


Question: index package in contrib (lucene index)

2009-05-28 Thread Tenaali Ram
Hi,

I am trying to understand the code of index package to build a distributed
Lucene index. I have some very basic questions and would really appreciate
if someone can help me understand this code-

1) If I already have Lucene index (divided into shards), should I upload
these indexes into HDFS and provide its location or the code will pick these
shards from local file system ?

2) How is the code adding a document in the lucene index, I can see there is
a index selection policy. Assuming round robin policy is chosen, how is the
code adding a document in the lucene index? This is related to first
question - is the index where the new document is to be added in HDFS or in
local file system. I read in the README that the index is first created on
local file system, then copied back to HDFS. Can someone please point me to
the code that is doing this.

3) After the map reduce job finishes, where are the final indexes ? In HDFS
?

4) Correct me if I am wrong- the code builds multiple indexes, where each
index is an instance of Lucene Index having a disjoint subset of documents
from the corpus. So, if I have to search a term, I have to search each index
and then merge the result. If this is correct, then how is the IDF of a term
which is a global statistic computed and updated in each index ? I mean each
index can compute the IDF wrt. to the subset of documents that it has, but
can not compute the global IDF of a term (since it knows nothing about other
indexes, which might have the same term in other documents).

Thanks,
-T