Hi,
My 0.02 PLN on the subject ...
Terminology
-----------
First the terminology: reading your emails I have a feeling that my head
is about to explode. We have to agree on the vocabulary, otherwise we
have no hope of reaching any consensus. I propose the following
vocabulary that has been in use and is generally understood:
* (global) search index: a complete collection of all indexed documents.
From a conceptual point of view, this is our complete search space.
* index shard: a non-overlapping part of the search index. All shards in
the system form together the complete search space of the search index.
E.g. having initially one big index I could divide it into multiple
shards using MultiPassIndexSplitter, and if I combined all the shards
again, using IndexMerger, I should obtain the original complete search
index (modulo changed Lucene docids .. doesn't matter). I strongly
believe in micro-sharding, because they are much easier to handle and
replicate. Also, since we control the shards we don't have to deal with
overlapping shards, which is the curse of P2P search.
* partitioning: a method whereby we can determine the target shard ID
based on a doc ID.
* search node: an application that provides search and update to one or
more shards.
* search host: a machine that may run 1 or more search nodes.
* Shard Manager: a component that keeps track of allocation of shards to
nodes (plus more, see below).
Now, to translate this into Solr-speak: depending on the details of the
design, and the evolution of Solr, one search node could be one Solr
instance that manages one shard per core. Let's forget here about the
current distributed search component, and the current replication - they
could be useful in this design as a raw transport mechanism, but someone
else would be calling the shots (see below).
Architecture
------------
The replication and load balancing is a problem with many existing
solutions, and this one in particular reminds me strongly of the Hadoop
HDFS. In fact, early on during the development of Hadoop [1] I wondered
whether we could reuse HDFS to manage Lucene indexes instead of opaque
blocks of fixed size. It turned out to be infeasible, but the model of
Namenode/Datanode still looks useful in our case, too.
I believe there are many useful lessons lurking in
Hadoop/HBase/Zookeeper that we could reuse in our design. The following
is just a straightforward port of the Namenode/Datanode concept.
Let's imagine a component called ShardManager that is responsible for
managing the following data:
* list of shard ID-s that together form the complete search index,
* for each shard ID, list of search nodes that serve this shard.
* issuing replication requests
* maintaining the partitioning function (see below), so that updates are
directed to correct shards
* maintaining heartbeat to check for dead nodes
* providing search clients with a list of nodes to query in order to
obtain all results from the search index.
Whenever a new search node comes up, it reports its local shard ID-s
(versioned) to the ShardManager. Based on these reports from the
currently active nodes, the ShardManager builds this mapping of shards
to nodes, and requests replication if some shards are too old, or if the
replication count is too low, allocating these shards to selected nodes
(based on a policy of some kind).
I believe most of the above functionality could be facilitated by
Zookeeper, including the election of the node that runs the ShardManager.
Updates
-------
We need a partitioning schema that splits documents more or less evenly
among shards, and at the same time allows us to split or merge
unbalanced shards. The simplest function that we could imagine is the
following:
hash(docId) % numShards
though this has the disadvantage that any larger update will affect
multiple shards, thus creating an avalanche of replication requests ...
so a sequential model would be probably better, where ranges of docIds
are assigned to shards.
Now, if any particular shard is too unbalanced, e.g. too large, it could
be further split in two halves, and the ShardManager would have to
record this exception. This is a very similar process to a region split
in HBase, or a page split in btree DBs. Conversely, shards that are too
small could be joined. This is the icing on the cake, so we can leave it
for later.
After commit, a node contacts the ShardManager to report a new version
of the shard. ShardManager issues replication requests to other nodes
that hold a replica of this shard.
Search
------
There should be a component sometimes referred to as query integrator
(or search front-end) that is the entry and exit point for user search
requests. On receiving a search request this component gets a list of
randomly selected nodes from SearchManager to contact (the list
containing all shards that form the global index), sends the query and
integrates partial results (under a configurable policy for
timeouts/early termination), and sends back the assembled results to the
user.
Again, somewhere in the background the knowledge of who to contact
should be handled by Zookeeper.
That's it for now from the top of my head ...
-----------
[1]
http://www.mail-archive.com/nutch-develop...@lists.sourceforge.net/msg02273.html
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com