Re: Lucene-based Distributed Index Leveraging Hadoop

Andrzej Bialecki Thu, 07 Feb 2008 11:54:22 -0800

Doug Cutting wrote:

Ning,
I am also interested in starting a new project in this area. Theapproach I have in mind is slightly different, but hopefully we can cometo some agreement and collaborate.


I'm interested in this too.

My current thinking is that the Solr search API is the appropriatemodel. Solr's facets are an important feature that require low-levelsupport to be practical. Thus a useful distributed search system shouldsupport facets from the outset, rather than attempt to graft them onlater. In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually wantto manage multiple replicas of each shard across the cluster (forreliability and performance) then overlapping documents would complicateboth the query dispatching process and the merging of partial result sets.

My primary difference with your proposal is that I would like to supportonline indexing. Documents could be inserted and removed directly, andshards would synchronize changes amongst replicas, with an "eventualconsistency" model. Indexes would not be stored in HDFS, but directlyon the local disk of each node. Hadoop would perhaps not play a role.In many ways this would resemble CouchDB, but with explicit support forsharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose allother HDFS benefits and have to start from scratch ... I wonder whatwould be the performance of FsDirectory over an HDFS index that is"pinned" to a local disk, i.e. a full local replica is available, withblock size of each index file equal to the file size.

A particular client should be able to provide a consistent read/writeview by bonding to particular replicas of a shard. Thus a user whomakes a modification should be able to generally see that modificationin results immediately, while other users, talking to differentreplicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager"that knows the latest versions of each shard among the whole active set- or that clients discover this dynamically by querying the shardservers every now and then.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Lucene-based Distributed Index Leveraging Hadoop

Reply via email to