Re: Distributed Lucene - from hadoop contrib
On 8/12/08, Deepika Khera [EMAIL PROTECTED] wrote: I was imagining the 2 concepts of i) using hadoop.contrib.index to index documents ii) providing search in a distributed fashion, to be all in one box. Ideally, yes. However, while it's good to use map/reduce when batch-building index, there is no consensus whether it'll be a good idea to serve index on HDFS. This is because of the poor performance of random reads in HDFS. On 8/14/08, Anoop Bhatti [EMAIL PROTECTED] wrote: I'd like to know if I'm heading down the right path, so my questions are: * Has anyone tried searching a distributed Lucene index using a method like this before? It seems too easy. Are there any gotchas that I should look out for as I scale up to more nodes and a larger index? * Do you think that going ahead with this approach, which consists of 1) creating a Lucene index using the hadoop.contrib.index code (thanks, Ning!) and 2) leaving that index in-place on hdfs and searching over it using the client code below, is a good approach? Yes, the code works on a single index shard. There is the performance concern described above. More importantly, as your index scales out, there will be multiple shards, and there are the challenges of load balance and fault tolerance, etc. * What is the status of the bailey project? It seems to be working on the same type of problem. Should I wait until that project comes out with code? There is no timeline for Bailey right now. Ning
RE: Distributed Lucene - from hadoop contrib
Thank you for your response. I was imagining the 2 concepts of i) using hadoop.contrib.index to index documents ii) providing search in a distributed fashion, to be all in one box. So basically, hadoop.contrib.index is used to create lucene indexes in a distributed fashion (by creating shards-each shard being a lucene instance). And then I can use Katta or any other Distributed Lucene application to serve lucene indexes distributed over many servers. Deepika -Original Message- From: Ning Li [mailto:[EMAIL PROTECTED] Sent: Friday, August 08, 2008 7:08 AM To: core-user@hadoop.apache.org Subject: Re: Distributed Lucene - from hadoop contrib 1) Katta n Distributed Lucene are different projects though, right? Both being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for Distributed Lucene there. 2) So, I should be able to use the hadoop.contrib.index with HDFS. Though, it would be much better if it is integrated with Distributed Lucene or the Katta project as these are designed keeping the structure and behavior of indexes in mind. Right? As described in the README file, hadoop.contrib.index uses map/reduce to build Lucene instances. It does not contain a component that serves queries. If that's not sufficient for you, you can check out the designs of Katta and Distributed Index and see which one suits your use better. Ning
Re: Distributed Lucene - from hadoop contrib
1) Katta n Distributed Lucene are different projects though, right? Both being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for Distributed Lucene there. 2) So, I should be able to use the hadoop.contrib.index with HDFS. Though, it would be much better if it is integrated with Distributed Lucene or the Katta project as these are designed keeping the structure and behavior of indexes in mind. Right? As described in the README file, hadoop.contrib.index uses map/reduce to build Lucene instances. It does not contain a component that serves queries. If that's not sufficient for you, you can check out the designs of Katta and Distributed Index and see which one suits your use better. Ning
RE: Distributed Lucene - from hadoop contrib
Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use distributed lucene from hadoop.contrib.index for indexing. Has anyone used this or tested it? Any issues or comments? I see that the design described is different from HDFS (Namenode is stateless, stores no information regarding blocks for files, etc) . Does anyone know how hard will it be to setup this kind of system or is there something that can be reused. A reference link - http://wiki.apache.org/hadoop/DistributedLucene Thanks, Deepika
Re: Distributed Lucene - from hadoop contrib
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/. Ning On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote: Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use distributed lucene from hadoop.contrib.index for indexing. Has anyone used this or tested it? Any issues or comments? I see that the design described is different from HDFS (Namenode is stateless, stores no information regarding blocks for files, etc) . Does anyone know how hard will it be to setup this kind of system or is there something that can be reused. A reference link - http://wiki.apache.org/hadoop/DistributedLucene Thanks, Deepika