Hi,

My 0.02 PLN on the subject ...

Terminology
-----------
First the terminology: reading your emails I have a feeling that my head is about to explode. We have to agree on the vocabulary, otherwise we have no hope of reaching any consensus. I propose the following vocabulary that has been in use and is generally understood:

* (global) search index: a complete collection of all indexed documents. From a conceptual point of view, this is our complete search space.

* index shard: a non-overlapping part of the search index. All shards in the system form together the complete search space of the search index. E.g. having initially one big index I could divide it into multiple shards using MultiPassIndexSplitter, and if I combined all the shards again, using IndexMerger, I should obtain the original complete search index (modulo changed Lucene docids .. doesn't matter). I strongly believe in micro-sharding, because they are much easier to handle and replicate. Also, since we control the shards we don't have to deal with overlapping shards, which is the curse of P2P search.

* partitioning: a method whereby we can determine the target shard ID based on a doc ID.

* search node: an application that provides search and update to one or more shards.

* search host: a machine that may run 1 or more search nodes.

* Shard Manager: a component that keeps track of allocation of shards to nodes (plus more, see below).

Now, to translate this into Solr-speak: depending on the details of the design, and the evolution of Solr, one search node could be one Solr instance that manages one shard per core. Let's forget here about the current distributed search component, and the current replication - they could be useful in this design as a raw transport mechanism, but someone else would be calling the shots (see below).

Architecture
------------
The replication and load balancing is a problem with many existing solutions, and this one in particular reminds me strongly of the Hadoop HDFS. In fact, early on during the development of Hadoop [1] I wondered whether we could reuse HDFS to manage Lucene indexes instead of opaque blocks of fixed size. It turned out to be infeasible, but the model of Namenode/Datanode still looks useful in our case, too.

I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper that we could reuse in our design. The following is just a straightforward port of the Namenode/Datanode concept.

Let's imagine a component called ShardManager that is responsible for managing the following data:

* list of shard ID-s that together form the complete search index,
* for each shard ID, list of search nodes that serve this shard.
* issuing replication requests
* maintaining the partitioning function (see below), so that updates are directed to correct shards
* maintaining heartbeat to check for dead nodes
* providing search clients with a list of nodes to query in order to obtain all results from the search index.

Whenever a new search node comes up, it reports its local shard ID-s (versioned) to the ShardManager. Based on these reports from the currently active nodes, the ShardManager builds this mapping of shards to nodes, and requests replication if some shards are too old, or if the replication count is too low, allocating these shards to selected nodes (based on a policy of some kind).

I believe most of the above functionality could be facilitated by Zookeeper, including the election of the node that runs the ShardManager.

Updates
-------
We need a partitioning schema that splits documents more or less evenly among shards, and at the same time allows us to split or merge unbalanced shards. The simplest function that we could imagine is the following:

        hash(docId) % numShards

though this has the disadvantage that any larger update will affect multiple shards, thus creating an avalanche of replication requests ... so a sequential model would be probably better, where ranges of docIds are assigned to shards.

Now, if any particular shard is too unbalanced, e.g. too large, it could be further split in two halves, and the ShardManager would have to record this exception. This is a very similar process to a region split in HBase, or a page split in btree DBs. Conversely, shards that are too small could be joined. This is the icing on the cake, so we can leave it for later.

After commit, a node contacts the ShardManager to report a new version of the shard. ShardManager issues replication requests to other nodes that hold a replica of this shard.

Search
------
There should be a component sometimes referred to as query integrator (or search front-end) that is the entry and exit point for user search requests. On receiving a search request this component gets a list of randomly selected nodes from SearchManager to contact (the list containing all shards that form the global index), sends the query and integrates partial results (under a configurable policy for timeouts/early termination), and sends back the assembled results to the user.

Again, somewhere in the background the knowledge of who to contact should be handled by Zookeeper.

That's it for now from the top of my head ...

-----------

[1] http://www.mail-archive.com/nutch-develop...@lists.sourceforge.net/msg02273.html

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to