Re: Partitioned Clusters

2009-02-21 Thread Ben Browning
On Fri, Feb 20, 2009 at 7:34 PM, Chris Anderson  wrote:
> I think so. I think that there could be proxy overlap / redundancy
> across all levels of the tree, and also in the case of a flat tree.
>
> As long as the proxies agree on how to hash from URLs to nodes it
> should just work.

I've been thinking about how to address the issue of allowing
different configurations for different needs. I think if all we do is
tell a proxy node who its children are, how to map IDs to those
children, and allow a proxy to also be a node, we can handle almost
any configuration.

Examples:
* All Peers - 2 nodes in the system, A & B. A is configured so odd IDs
map to A, even IDs map to B. B is configured with the same ID ranges.
You can load-balance across nodes A & B and take advantage of
increased write throughput. This is probably the simplest clustering
scenario for people that don't have enough traffic to fully utilize a
standalone proxy node.

* 1 or more proxies, multiple nodes - The proxies are all configured
identically to map document IDs among nodes A-J. Nodes A-J know
nothing about each other or their parents. In this scenario you can
add very easily add proxy nodes as needed to handle the increased load
when aggregating results from more nodes.

* Tree structure - The top-level proxies are configured to map
document IDs to nodes. These nodes may in fact be other proxies which
are then configured to map to their nodes. Except for multiple levels
of proxies, this is the same as the above scenario.

Does it sound reasonable to expect a proxy to be aware of its children
but not vice-versa? In an actual implementation I see the list of
children and their mappings being stored in a document so that it
could be updated while running to add/remove children.

Adding a child in this scenario would involve choosing an ID range,
replicating the relevant data from the other children, and updating
this mapping. This would depend on partial replication to replicate
only the data needed for the new child. I don't see this as something
that's too complex - the only issue I see is you'll probably need to
replicate data at least twice, once before the proxy mapping is
updated and once after to get any final data that was written to the
other children since the first replication. This also assumes you've
chosen a consistent hashing algorithm so that the data on all nodes
doesn't have to change when adding a single new node.

Removing a child node would be the opposite process. I could foresee
us coming up with a tool to automate most if not all of this process,
possibly only requiring the user to start the new CouchDB server, fill
in some values in Futon for ID mappings, and press a button.


Sound reasonable?

- Ben


Re: Partitioned Clusters

2009-02-20 Thread Chris Anderson
On Fri, Feb 20, 2009 at 4:15 PM, Mike Malone  wrote:
> Hi, I don't think I've commented on this list before so let me briefly
> introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer
> (primarily web dev) and have some experience working with large clustered
> databases. I worked for Pownce.com, but moved to Six Apart when they
> acquired Pownce in November 2008.
>
> I like the idea of a tree-structure since it's simple to understand and
> implement, but I think there may be cases where having multiple top-level
> proxies may make sense.

I think so. I think that there could be proxy overlap / redundancy
across all levels of the tree, and also in the case of a flat tree.

As long as the proxies agree on how to hash from URLs to nodes it
should just work.

-- 
Chris Anderson
http://jchris.mfdz.com


Re: Partitioned Clusters

2009-02-20 Thread Mike Malone
Hi, I don't think I've commented on this list before so let me briefly
introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer
(primarily web dev) and have some experience working with large clustered
databases. I worked for Pownce.com, but moved to Six Apart when they
acquired Pownce in November 2008.

I like the idea of a tree-structure since it's simple to understand and
implement, but I think there may be cases where having multiple top-level
proxies may make sense. As Damien pointed out, the top-level proxies will
need to re-reduce / merge the documents from each partition, which may
become a bottleneck. Damien pointed out how a tree structure would help to
mitigate this problem by moving some of the work to sub-nodes. But couldn't
you also add additional top-level proxies (with clients randomly choosing
one to communicate with) to increase capacity without requiring a tree
structure. This would also remove the top-level proxy as a single point of
failure for the system.

Mike

On Fri, Feb 20, 2009 at 2:55 PM, Chris Anderson  wrote:

> On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz  wrote:
> >
> > On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:
> >
> >>>
> >>> Trees would be overkill except for with very large clusters.
> >>>
> >>
> >>> With CouchDB map views, you need to combine results from every node in
> a
> >>> big merge sort. If you combine all results at a single node, the single
> >>> clients ability to simultaneously pull data and sort data from all
> other
> >>> nodes may become the bottleneck. So to parallelize, you have multiple
> >>> nodes
> >>> doing a merge sort of sub nodes , then sending those results to another
> >>> node
> >>> to be combined further, etc.  The same with with the reduce views, but
> >>> instead of a merge sort it's just rereducing results. The natural
> "shape"
> >>> of
> >>> that computation is a tree, with only the final root node at the top
> >>> being
> >>> the bottleneck, but now it has to maintain connections and merge the
> sort
> >>> values from far fewer nodes.
> >>>
> >>> -Damien
> >>
> >>
> >> That makes sense and it clarifies one of my questions about this topic.
> Is
> >> the goal of partitioned clustering to increase performance for very
> large
> >> data sets, or to increase reliability? It would seem from this answere
> >> that
> >> the goal is to increase query performance by distributing the query
> >> processing, and not to increase reliability.
> >
> >
> > I see partitioning and clustering as 2 different things. Partitioning is
> > data partitioning, spreading the data out across nodes, no node having
> the
> > complete database. Clustering is nodes having the same, or nearly the
> same
> > data (they might be behind on replicating changes, but otherwise they
> have
> > the same data).
> >
> > Partitioning would primarily increase write performance (updates
> happening
> > concurrently on many nodes) and the size of the data set. Partitioning
> helps
> > with client read scalability, but only for document reads, not views
> > queries. Partitioning alone could reduce reliability, depending how
> tolerant
> > you are to missing portions of the database.
> >
> > Clustering would primarily address database reliability (failover),
> address
> > client read scalability for docs and views. Clustering doesn't help much
> > with write performance because even if you spread out the update load,
> the
> > replication as the cluster syncs up means every node gets the update
> anyway.
> > It might be useful to deal with update spikes, where you get a bunch of
> > updates at once and can wait for the replication delay to get everyone
> > synced back up.
> >
> > For really big, really reliable database, I'd have clusters of
> partitions,
> > where the database is partitioned N ways, each each partition have at
> least
> > M identical cluster members. Increase N for larger databases and update
> > load, M for higher availability and read load.
> >
>
> Thanks for the clarification.
>
> Can you say anything about how you see rebalancing working?
>
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>


Re: Partitioned Clusters

2009-02-20 Thread Chris Anderson
On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz  wrote:
>
> On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:
>
>>>
>>> Trees would be overkill except for with very large clusters.
>>>
>>
>>> With CouchDB map views, you need to combine results from every node in a
>>> big merge sort. If you combine all results at a single node, the single
>>> clients ability to simultaneously pull data and sort data from all other
>>> nodes may become the bottleneck. So to parallelize, you have multiple
>>> nodes
>>> doing a merge sort of sub nodes , then sending those results to another
>>> node
>>> to be combined further, etc.  The same with with the reduce views, but
>>> instead of a merge sort it's just rereducing results. The natural "shape"
>>> of
>>> that computation is a tree, with only the final root node at the top
>>> being
>>> the bottleneck, but now it has to maintain connections and merge the sort
>>> values from far fewer nodes.
>>>
>>> -Damien
>>
>>
>> That makes sense and it clarifies one of my questions about this topic. Is
>> the goal of partitioned clustering to increase performance for very large
>> data sets, or to increase reliability? It would seem from this answere
>> that
>> the goal is to increase query performance by distributing the query
>> processing, and not to increase reliability.
>
>
> I see partitioning and clustering as 2 different things. Partitioning is
> data partitioning, spreading the data out across nodes, no node having the
> complete database. Clustering is nodes having the same, or nearly the same
> data (they might be behind on replicating changes, but otherwise they have
> the same data).
>
> Partitioning would primarily increase write performance (updates happening
> concurrently on many nodes) and the size of the data set. Partitioning helps
> with client read scalability, but only for document reads, not views
> queries. Partitioning alone could reduce reliability, depending how tolerant
> you are to missing portions of the database.
>
> Clustering would primarily address database reliability (failover), address
> client read scalability for docs and views. Clustering doesn't help much
> with write performance because even if you spread out the update load, the
> replication as the cluster syncs up means every node gets the update anyway.
> It might be useful to deal with update spikes, where you get a bunch of
> updates at once and can wait for the replication delay to get everyone
> synced back up.
>
> For really big, really reliable database, I'd have clusters of partitions,
> where the database is partitioned N ways, each each partition have at least
> M identical cluster members. Increase N for larger databases and update
> load, M for higher availability and read load.
>

Thanks for the clarification.

Can you say anything about how you see rebalancing working?



-- 
Chris Anderson
http://jchris.mfdz.com


Re: Partitioned Clusters

2009-02-20 Thread Chris Anderson
On Fri, Feb 20, 2009 at 1:37 PM, Stefan Karpinski
 wrote:
>
>  That makes sense and it clarifies one of my questions about this topic. Is
> the goal of partitioned clustering to increase performance for very large
> data sets, or to increase reliability? It would seem from this answere that
> the goal is to increase query performance by distributing the query
> processing, and not to increase reliability.


Data redundancy is taken care of orthogonally to partitioning. Each
node will be able to handle maintaining N hot-failover backups.
Whether the database is hosted on a single large node or partitioned
among many small ones, the redundancy story is the same.

Partitioning becomes useful when either the total update rate is
greater than the hard-disk throughput on a single node, or the stored
capacity is better managed by multiple disks.

By spreading write load across nodes you can achieve greater
throughput. The view queries must be sent to every node, so having
docs partitioned also allows views to be calculated in parallel. It
will be interesting to see if it makes sense to partition small
databases across hundreds of nodes in the interested of performance.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: Partitioned Clusters

2009-02-20 Thread Damien Katz


On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:



Trees would be overkill except for with very large clusters.



With CouchDB map views, you need to combine results from every node  
in a
big merge sort. If you combine all results at a single node, the  
single
clients ability to simultaneously pull data and sort data from all  
other
nodes may become the bottleneck. So to parallelize, you have  
multiple nodes
doing a merge sort of sub nodes , then sending those results to  
another node
to be combined further, etc.  The same with with the reduce views,  
but
instead of a merge sort it's just rereducing results. The natural  
"shape" of
that computation is a tree, with only the final root node at the  
top being
the bottleneck, but now it has to maintain connections and merge  
the sort

values from far fewer nodes.

-Damien



That makes sense and it clarifies one of my questions about this  
topic. Is
the goal of partitioned clustering to increase performance for very  
large
data sets, or to increase reliability? It would seem from this  
answere that

the goal is to increase query performance by distributing the query
processing, and not to increase reliability.



I see partitioning and clustering as 2 different things. Partitioning  
is data partitioning, spreading the data out across nodes, no node  
having the complete database. Clustering is nodes having the same, or  
nearly the same data (they might be behind on replicating changes, but  
otherwise they have the same data).


Partitioning would primarily increase write performance (updates  
happening concurrently on many nodes) and the size of the data set.  
Partitioning helps with client read scalability, but only for document  
reads, not views queries. Partitioning alone could reduce reliability,  
depending how tolerant you are to missing portions of the database.


Clustering would primarily address database reliability (failover),  
address client read scalability for docs and views. Clustering doesn't  
help much with write performance because even if you spread out the  
update load, the replication as the cluster syncs up means every node  
gets the update anyway. It might be useful to deal with update spikes,  
where you get a bunch of updates at once and can wait for the  
replication delay to get everyone synced back up.


For really big, really reliable database, I'd have clusters of  
partitions, where the database is partitioned N ways, each each  
partition have at least M identical cluster members. Increase N for  
larger databases and update load, M for higher availability and read  
load.


-Damien


Re: Partitioned Clusters

2009-02-20 Thread Stefan Karpinski
>
> Trees would be overkill except for with very large clusters.
>

> With CouchDB map views, you need to combine results from every node in a
> big merge sort. If you combine all results at a single node, the single
> clients ability to simultaneously pull data and sort data from all other
> nodes may become the bottleneck. So to parallelize, you have multiple nodes
> doing a merge sort of sub nodes , then sending those results to another node
> to be combined further, etc.  The same with with the reduce views, but
> instead of a merge sort it's just rereducing results. The natural "shape" of
> that computation is a tree, with only the final root node at the top being
> the bottleneck, but now it has to maintain connections and merge the sort
> values from far fewer nodes.
>
> -Damien


 That makes sense and it clarifies one of my questions about this topic. Is
the goal of partitioned clustering to increase performance for very large
data sets, or to increase reliability? It would seem from this answere that
the goal is to increase query performance by distributing the query
processing, and not to increase reliability.


Re: Partitioned Clusters

2009-02-20 Thread Damien Katz


On Feb 20, 2009, at 1:55 PM, Stefan Karpinski wrote:


Hi, I thought I'd introduce myself since I'm new here on the couchdb
list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
Akamai, Operations R&D at Citrix Online, and I'm nearly done with a
PhD in computer networking at the moment. So I guess I've thought
about this kind of stuff a bit ;-)

I'm curious what the motivation behind a tree topology is. Not that
it's not a viable approach, just why that and not a load-balancer in
front of a bunch of "leaves" with lateral propagation between the
leaves? Why should the load-balancing/proxying/caching node even be
running couchdb?

One reason I can see for a tree topology would be the hierarchical
cache effect. But that would likely only make sense in certain
circumstances. Being able to configure the topology to meet various
needs, rather than enforcing one particular topology makes more sense
to me overall.


Trees would be overkill except for with very large clusters.

With CouchDB map views, you need to combine results from every node in  
a big merge sort. If you combine all results at a single node, the  
single clients ability to simultaneously pull data and sort data from  
all other nodes may become the bottleneck. So to parallelize, you have  
multiple nodes doing a merge sort of sub nodes , then sending those  
results to another node to be combined further, etc.  The same with  
with the reduce views, but instead of a merge sort it's just  
rereducing results. The natural "shape" of that computation is a tree,  
with only the final root node at the top being the bottleneck, but now  
it has to maintain connections and merge the sort values from far  
fewer nodes.


-Damien



Re: Partitioned Clusters

2009-02-20 Thread Chris Anderson
On Fri, Feb 20, 2009 at 10:55 AM, Stefan Karpinski
 wrote:
> Hi, I thought I'd introduce myself since I'm new here on the couchdb
> list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
> Akamai, Operations R&D at Citrix Online, and I'm nearly done with a
> PhD in computer networking at the moment. So I guess I've thought
> about this kind of stuff a bit ;-)

Glad to have you with us. :)

>
> I'm curious what the motivation behind a tree topology is. Not that
> it's not a viable approach, just why that and not a load-balancer in
> front of a bunch of "leaves" with lateral propagation between the
> leaves? Why should the load-balancing/proxying/caching node even be
> running couchdb?

The reason to write the proxies as Erlang, is that they can avoid the
JSON and HTTP overhead until the final stage, as well as use Erlang's
inter-node communication and process management mojo.

The tree structure also provides a nice mapping onto the existing
reduce implementation. Inner nodes can store the reduction values for
their leaf nodes and run the reduce function to come up with total
values.

>
> One reason I can see for a tree topology would be the hierarchical
> cache effect. But that would likely only make sense in certain
> circumstances. Being able to configure the topology to meet various
> needs, rather than enforcing one particular topology makes more sense
> to me overall.

I agree - as Ben points out the flat topology is just a special case
of the tree (and would probably be ideal for anything less than
hundreds of nodes).

I'm not an expert on cluster layout, but the tree structure appeals to
me mostly because changes to subtrees don't need to be propagated to
the cluster root.

That said, there's *plenty* that can be done with HTTP proxies (and
probably implemented more quickly) so it's probably the best way to
prototype any of these implementations.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: Partitioned Clusters

2009-02-20 Thread Stefan Karpinski
Hi, I thought I'd introduce myself since I'm new here on the couchdb
list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
Akamai, Operations R&D at Citrix Online, and I'm nearly done with a
PhD in computer networking at the moment. So I guess I've thought
about this kind of stuff a bit ;-)

I'm curious what the motivation behind a tree topology is. Not that
it's not a viable approach, just why that and not a load-balancer in
front of a bunch of "leaves" with lateral propagation between the
leaves? Why should the load-balancing/proxying/caching node even be
running couchdb?

One reason I can see for a tree topology would be the hierarchical
cache effect. But that would likely only make sense in certain
circumstances. Being able to configure the topology to meet various
needs, rather than enforcing one particular topology makes more sense
to me overall.

On 2/20/09, Robert Newson  wrote:
> Any thoughts as to how (or even if) this tree-wise result aggregation
> would work for externals?
>
> I'm thinking specifically about couchdb-lucene, where multi-node
> results aggregation is possible, given a framework like you propose
> here. The results that couchdb-lucene produces can already be
> aggregated, assuming there's a hook for the merge function (actually,
> perhaps it's exactly reduce-shaped)...
>
> B.
>
> On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson  wrote:
>> On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning  wrote:
>>> Overall the model sounds very similar to what I was thinking. I just
>>> have a few comments.
>>>
 In this model documents are saved to a leaf node depending on a hash
 of the docid. This means that lookups are easy, and need only to touch
 the leaf node which holds the doc. Redundancy can be provided by
 maintaining R replicas of every leaf node.
>>>
>>> There are several use-cases where a true hash of the docid won't be the
>>> optimal partitioning key. The simple case is where you want to partition
>>> your data by user and in most non-trivial cases you won't be storing
>>> all of a user's data under one document with the user's id as the docid.
>>> A fairly simple solution would be allowing the developer to specify a
>>> javascript
>>> function somewhere (not sure where this should live...) that takes a
>>> docid and
>>> spits out a partition key. Then I could just prefix all my doc ids for
>>> a specific user
>>> with that user's id and write the appropriate partition function.
>>>

 View queries, on the other hand, must be handled by every node. The
 requests are proxied down the tree to leaf nodes, which respond
 normally. Each proxy node then runs a merge sort algorithm (which can
 sort in constant space proportional to # of input streams) on the view
 results. This can happen recursively if the tree is deep.
>>>
>>> If the developer has control over partition keys as suggested above, it's
>>> entirely possible to have applications where view queries only need data
>>> from one partition. It would be great if we could do something smart here
>>> or
>>> have a way for the developer to indicate to Couch that all the data
>>> should
>>> be on only one partition.
>>>
>>> These are just nice-to-have features and the described cluster setup
>>> could
>>> still be extremely useful without them.
>>
>> I think they are both sensible optimizations. Damien's described the
>> JS partition function before on IRC, so I think it fits into the
>> model. As far as restricting view queries to just those docs within a
>> particular id range, it might make sense to partition by giving each
>> user their own database, rather than logic on the docid. In the case
>> where you need data in a single db, but still have some queries that
>> can be partitioned, its still a good optimization. Luckily even in the
>> unoptimized case, if a node has no rows to contribute to the final
>> view result than it should have a low impact on total resources needed
>> to generate the result.
>>
>> Chris
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>


Re: Partitioned Clusters

2009-02-20 Thread Robert Newson
Any thoughts as to how (or even if) this tree-wise result aggregation
would work for externals?

I'm thinking specifically about couchdb-lucene, where multi-node
results aggregation is possible, given a framework like you propose
here. The results that couchdb-lucene produces can already be
aggregated, assuming there's a hook for the merge function (actually,
perhaps it's exactly reduce-shaped)...

B.

On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson  wrote:
> On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning  wrote:
>> Overall the model sounds very similar to what I was thinking. I just
>> have a few comments.
>>
>>> In this model documents are saved to a leaf node depending on a hash
>>> of the docid. This means that lookups are easy, and need only to touch
>>> the leaf node which holds the doc. Redundancy can be provided by
>>> maintaining R replicas of every leaf node.
>>
>> There are several use-cases where a true hash of the docid won't be the
>> optimal partitioning key. The simple case is where you want to partition
>> your data by user and in most non-trivial cases you won't be storing
>> all of a user's data under one document with the user's id as the docid.
>> A fairly simple solution would be allowing the developer to specify a 
>> javascript
>> function somewhere (not sure where this should live...) that takes a docid 
>> and
>> spits out a partition key. Then I could just prefix all my doc ids for
>> a specific user
>> with that user's id and write the appropriate partition function.
>>
>>>
>>> View queries, on the other hand, must be handled by every node. The
>>> requests are proxied down the tree to leaf nodes, which respond
>>> normally. Each proxy node then runs a merge sort algorithm (which can
>>> sort in constant space proportional to # of input streams) on the view
>>> results. This can happen recursively if the tree is deep.
>>
>> If the developer has control over partition keys as suggested above, it's
>> entirely possible to have applications where view queries only need data
>> from one partition. It would be great if we could do something smart here or
>> have a way for the developer to indicate to Couch that all the data should
>> be on only one partition.
>>
>> These are just nice-to-have features and the described cluster setup could
>> still be extremely useful without them.
>
> I think they are both sensible optimizations. Damien's described the
> JS partition function before on IRC, so I think it fits into the
> model. As far as restricting view queries to just those docs within a
> particular id range, it might make sense to partition by giving each
> user their own database, rather than logic on the docid. In the case
> where you need data in a single db, but still have some queries that
> can be partitioned, its still a good optimization. Luckily even in the
> unoptimized case, if a node has no rows to contribute to the final
> view result than it should have a low impact on total resources needed
> to generate the result.
>
> Chris
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>


Re: Partitioned Clusters

2009-02-19 Thread Chris Anderson
On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning  wrote:
> Overall the model sounds very similar to what I was thinking. I just
> have a few comments.
>
>> In this model documents are saved to a leaf node depending on a hash
>> of the docid. This means that lookups are easy, and need only to touch
>> the leaf node which holds the doc. Redundancy can be provided by
>> maintaining R replicas of every leaf node.
>
> There are several use-cases where a true hash of the docid won't be the
> optimal partitioning key. The simple case is where you want to partition
> your data by user and in most non-trivial cases you won't be storing
> all of a user's data under one document with the user's id as the docid.
> A fairly simple solution would be allowing the developer to specify a 
> javascript
> function somewhere (not sure where this should live...) that takes a docid and
> spits out a partition key. Then I could just prefix all my doc ids for
> a specific user
> with that user's id and write the appropriate partition function.
>
>>
>> View queries, on the other hand, must be handled by every node. The
>> requests are proxied down the tree to leaf nodes, which respond
>> normally. Each proxy node then runs a merge sort algorithm (which can
>> sort in constant space proportional to # of input streams) on the view
>> results. This can happen recursively if the tree is deep.
>
> If the developer has control over partition keys as suggested above, it's
> entirely possible to have applications where view queries only need data
> from one partition. It would be great if we could do something smart here or
> have a way for the developer to indicate to Couch that all the data should
> be on only one partition.
>
> These are just nice-to-have features and the described cluster setup could
> still be extremely useful without them.

I think they are both sensible optimizations. Damien's described the
JS partition function before on IRC, so I think it fits into the
model. As far as restricting view queries to just those docs within a
particular id range, it might make sense to partition by giving each
user their own database, rather than logic on the docid. In the case
where you need data in a single db, but still have some queries that
can be partitioned, its still a good optimization. Luckily even in the
unoptimized case, if a node has no rows to contribute to the final
view result than it should have a low impact on total resources needed
to generate the result.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com


Re: Partitioned Clusters

2009-02-19 Thread Ben Browning
Overall the model sounds very similar to what I was thinking. I just
have a few comments.

> In this model documents are saved to a leaf node depending on a hash
> of the docid. This means that lookups are easy, and need only to touch
> the leaf node which holds the doc. Redundancy can be provided by
> maintaining R replicas of every leaf node.

There are several use-cases where a true hash of the docid won't be the
optimal partitioning key. The simple case is where you want to partition
your data by user and in most non-trivial cases you won't be storing
all of a user's data under one document with the user's id as the docid.
A fairly simple solution would be allowing the developer to specify a javascript
function somewhere (not sure where this should live...) that takes a docid and
spits out a partition key. Then I could just prefix all my doc ids for
a specific user
with that user's id and write the appropriate partition function.

>
> View queries, on the other hand, must be handled by every node. The
> requests are proxied down the tree to leaf nodes, which respond
> normally. Each proxy node then runs a merge sort algorithm (which can
> sort in constant space proportional to # of input streams) on the view
> results. This can happen recursively if the tree is deep.

If the developer has control over partition keys as suggested above, it's
entirely possible to have applications where view queries only need data
from one partition. It would be great if we could do something smart here or
have a way for the developer to indicate to Couch that all the data should
be on only one partition.

These are just nice-to-have features and the described cluster setup could
still be extremely useful without them.

The tree setup sounds interesting but I wonder how it would compare in
latency to a flat setup with the same number of leaf nodes. As long as the
developer can control the tree structure (# of children per parent) then this
concern shouldn't be an issue.

- Ben