Re: Distributed Lucene - from hadoop contrib

2008-08-18 Thread Ning Li
On 8/12/08, Deepika Khera [EMAIL PROTECTED] wrote:
 I was imagining the 2 concepts of i) using hadoop.contrib.index to index
 documents ii) providing search in a distributed fashion, to be all in
 one box.

Ideally, yes. However, while it's good to use map/reduce when
batch-building index, there is no consensus whether it'll be a good
idea to serve index on HDFS. This is because of the poor performance
of random reads in HDFS.


On 8/14/08, Anoop Bhatti [EMAIL PROTECTED] wrote:
 I'd like to know if I'm heading down the right path, so my questions are:
 * Has anyone tried searching a distributed Lucene index using a method
 like this before?  It seems too easy.  Are there any gotchas that I
 should look out for as I scale up to more nodes and a larger index?
 * Do you think that going ahead with this approach, which consists of
 1) creating a Lucene index using the  hadoop.contrib.index code
 (thanks, Ning!) and 2) leaving that index in-place on hdfs and
 searching over it using the client code below, is a good approach?

Yes, the code works on a single index shard. There is the performance
concern described above. More importantly, as your index scales out,
there will be multiple shards, and there are the challenges of load
balance and fault tolerance, etc.


 * What is the status of the bailey project?  It seems to be working on
 the same type of problem. Should I wait until that project comes out
 with code?

There is no timeline for Bailey right now.

Ning


RE: Distributed Lucene - from hadoop contrib

2008-08-12 Thread Deepika Khera
Thank you for your response. 

I was imagining the 2 concepts of i) using hadoop.contrib.index to index
documents ii) providing search in a distributed fashion, to be all in
one box. 

So basically, hadoop.contrib.index is used to create lucene indexes in
a distributed fashion (by creating shards-each shard being a lucene
instance). And then I can use Katta or any other Distributed Lucene
application to serve lucene indexes distributed over many servers.

Deepika 


-Original Message-
From: Ning Li [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 7:08 AM
To: core-user@hadoop.apache.org
Subject: Re: Distributed Lucene - from hadoop contrib

 1) Katta n Distributed Lucene are different projects though, right?
Both
 being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

 2) So, I should be able to use the hadoop.contrib.index with HDFS.
 Though, it would be much better if it is integrated with Distributed
 Lucene or the Katta project as these are designed keeping the
 structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
 1) Katta n Distributed Lucene are different projects though, right? Both
 being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

 2) So, I should be able to use the hadoop.contrib.index with HDFS.
 Though, it would be much better if it is integrated with Distributed
 Lucene or the Katta project as these are designed keeping the
 structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


RE: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Deepika Khera
Hey guys,

I would appreciate any feedback on this

Deepika

-Original Message-
From: Deepika Khera [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 06, 2008 5:39 PM
To: core-user@hadoop.apache.org
Subject: Distributed Lucene - from hadoop contrib

Hi,

 

I am planning to use distributed lucene from hadoop.contrib.index for
indexing. Has anyone used this or tested it? Any issues or comments?

 

I see that the design described is different from HDFS (Namenode is
stateless, stores no information regarding blocks for files, etc) . Does
anyone know how hard will it be to setup this kind of system or is there
something that can be reused.

 

A reference link -

 

http://wiki.apache.org/hadoop/DistributedLucene

 

Thanks,
Deepika



Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene
and hadoop.contrib.index are two different things.

For information on hadoop.contrib.index, see the README file in the package.

I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene
at http://katta.wiki.sourceforge.net/.

Ning


On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote:
 Hey guys,

 I would appreciate any feedback on this

 Deepika

 -Original Message-
 From: Deepika Khera [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 06, 2008 5:39 PM
 To: core-user@hadoop.apache.org
 Subject: Distributed Lucene - from hadoop contrib

 Hi,



 I am planning to use distributed lucene from hadoop.contrib.index for
 indexing. Has anyone used this or tested it? Any issues or comments?



 I see that the design described is different from HDFS (Namenode is
 stateless, stores no information regarding blocks for files, etc) . Does
 anyone know how hard will it be to setup this kind of system or is there
 something that can be reused.



 A reference link -



 http://wiki.apache.org/hadoop/DistributedLucene



 Thanks,
 Deepika