Doug Cutting wrote:
Ning,

I am also interested in starting a new project in this area. The approach I have in mind is slightly different, but hopefully we can come to some agreement and collaborate.

I'm interested in this too.

My current thinking is that the Solr search API is the appropriate model. Solr's facets are an important feature that require low-level support to be practical. Thus a useful distributed search system should support facets from the outset, rather than attempt to graft them on later. In particular, I believe this requirement mandates disjoint shards.

I agree - shards should be disjoint also because if we eventually want to manage multiple replicas of each shard across the cluster (for reliability and performance) then overlapping documents would complicate both the query dispatching process and the merging of partial result sets.


My primary difference with your proposal is that I would like to support online indexing. Documents could be inserted and removed directly, and shards would synchronize changes amongst replicas, with an "eventual consistency" model. Indexes would not be stored in HDFS, but directly on the local disk of each node. Hadoop would perhaps not play a role. In many ways this would resemble CouchDB, but with explicit support for sharding and failover from the outset.

It's true that searching over HDFS is slow - but I'd hate to lose all other HDFS benefits and have to start from scratch ... I wonder what would be the performance of FsDirectory over an HDFS index that is "pinned" to a local disk, i.e. a full local replica is available, with block size of each index file equal to the file size.


A particular client should be able to provide a consistent read/write view by bonding to particular replicas of a shard. Thus a user who makes a modification should be able to generally see that modification in results immediately, while other users, talking to different replicas, may not see it until synchronization is complete.

This requires that we use versioning, and that we have a "shard manager" that knows the latest versions of each shard among the whole active set - or that clients discover this dynamically by querying the shard servers every now and then.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to