Doug Cutting wrote:
Ning,
I am also interested in starting a new project in this area. The
approach I have in mind is slightly different, but hopefully we can come
to some agreement and collaborate.
I'm interested in this too.
My current thinking is that the Solr search API is the appropriate
model. Solr's facets are an important feature that require low-level
support to be practical. Thus a useful distributed search system should
support facets from the outset, rather than attempt to graft them on
later. In particular, I believe this requirement mandates disjoint shards.
I agree - shards should be disjoint also because if we eventually want
to manage multiple replicas of each shard across the cluster (for
reliability and performance) then overlapping documents would complicate
both the query dispatching process and the merging of partial result sets.
My primary difference with your proposal is that I would like to support
online indexing. Documents could be inserted and removed directly, and
shards would synchronize changes amongst replicas, with an "eventual
consistency" model. Indexes would not be stored in HDFS, but directly
on the local disk of each node. Hadoop would perhaps not play a role.
In many ways this would resemble CouchDB, but with explicit support for
sharding and failover from the outset.
It's true that searching over HDFS is slow - but I'd hate to lose all
other HDFS benefits and have to start from scratch ... I wonder what
would be the performance of FsDirectory over an HDFS index that is
"pinned" to a local disk, i.e. a full local replica is available, with
block size of each index file equal to the file size.
A particular client should be able to provide a consistent read/write
view by bonding to particular replicas of a shard. Thus a user who
makes a modification should be able to generally see that modification
in results immediately, while other users, talking to different
replicas, may not see it until synchronization is complete.
This requires that we use versioning, and that we have a "shard manager"
that knows the latest versions of each shard among the whole active set
- or that clients discover this dynamically by querying the shard
servers every now and then.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com