Re: Partitioned Clusters

Damien Katz Fri, 20 Feb 2009 14:46:26 -0800


On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:

Trees would be overkill except for with very large clusters.
With CouchDB map views, you need to combine results from every nodein abig merge sort. If you combine all results at a single node, thesingleclients ability to simultaneously pull data and sort data from allothernodes may become the bottleneck. So to parallelize, you havemultiple nodesdoing a merge sort of sub nodes , then sending those results toanother nodeto be combined further, etc. The same with with the reduce views,butinstead of a merge sort it's just rereducing results. The natural"shape" ofthat computation is a tree, with only the final root node at thetop beingthe bottleneck, but now it has to maintain connections and mergethe sort
values from far fewer nodes.

-Damien
That makes sense and it clarifies one of my questions about thistopic. Isthe goal of partitioned clustering to increase performance for verylargedata sets, or to increase reliability? It would seem from thisanswere that
the goal is to increase query performance by distributing the query
processing, and not to increase reliability.

I see partitioning and clustering as 2 different things. Partitioningis data partitioning, spreading the data out across nodes, no nodehaving the complete database. Clustering is nodes having the same, ornearly the same data (they might be behind on replicating changes, butotherwise they have the same data).

Partitioning would primarily increase write performance (updateshappening concurrently on many nodes) and the size of the data set.Partitioning helps with client read scalability, but only for documentreads, not views queries. Partitioning alone could reduce reliability,depending how tolerant you are to missing portions of the database.

Clustering would primarily address database reliability (failover),address client read scalability for docs and views. Clustering doesn'thelp much with write performance because even if you spread out theupdate load, the replication as the cluster syncs up means every nodegets the update anyway. It might be useful to deal with update spikes,where you get a bunch of updates at once and can wait for thereplication delay to get everyone synced back up.

For really big, really reliable database, I'd have clusters ofpartitions, where the database is partitioned N ways, each eachpartition have at least M identical cluster members. Increase N forlarger databases and update load, M for higher availability and readload.


-Damien

Re: Partitioned Clusters

Reply via email to