Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code-
1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS and provide its location or the code will pick these shards from local file system ? 2) How is the code adding a document in the lucene index, I can see there is a index selection policy. Assuming round robin policy is chosen, how is the code adding a document in the lucene index? This is related to first question - is the index where the new document is to be added in HDFS or in local file system. I read in the README that the index is first created on local file system, then copied back to HDFS. Can someone please point me to the code that is doing this. 3) After the map reduce job finishes, where are the final indexes ? In HDFS ? 4) Correct me if I am wrong- the code builds multiple indexes, where each index is an instance of Lucene Index having a disjoint subset of documents from the corpus. So, if I have to search a term, I have to search each index and then merge the result. If this is correct, then how is the IDF of a term which is a global statistic computed and updated in each index ? I mean each index can compute the IDF wrt. to the subset of documents that it has, but can not compute the global IDF of a term (since it knows nothing about other indexes, which might have the same term in other documents). Thanks, -T