Question: index package in contrib (lucene index)
Anyone ? Any help to understand this package is appreciated. Thanks, T On Thu, May 28, 2009 at 3:18 PM, Tenaali Ram tenaali...@gmail.com wrote: Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). Thanks, -T
Re: Question: index package in contrib (lucene index)
Reply inlined below. Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 jun...@almaden.ibm.com Tenaali Ram tenaali...@gmail.com wrote on 05/28/2009 03:18:53 PM: Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? Yes, you need to put the old index to HDFS first. 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. See contrib.index.example. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? They will be in HDFS. 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). This package only deals with index builds. The shards are disjoint and it's up to the index server to calculate the ranks. For distributed TF/IDF support, you may want to look into Katta. Thanks, -T
Re: Question: index package in contrib (lucene index)
Thanks Jun! On Fri, May 29, 2009 at 2:49 PM, Jun Rao jun...@almaden.ibm.com wrote: Reply inlined below. Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 jun...@almaden.ibm.com Tenaali Ram tenaali...@gmail.com wrote on 05/28/2009 03:18:53 PM: Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? Yes, you need to put the old index to HDFS first. 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. See contrib.index.example. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? They will be in HDFS. 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). This package only deals with index builds. The shards are disjoint and it's up to the index server to calculate the ranks. For distributed TF/IDF support, you may want to look into Katta. Thanks, -T
Question: index package in contrib (lucene index)
Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). Thanks, -T