Re: Partitioned Clusters
On Fri, Feb 20, 2009 at 7:34 PM, Chris Anderson wrote: > I think so. I think that there could be proxy overlap / redundancy > across all levels of the tree, and also in the case of a flat tree. > > As long as the proxies agree on how to hash from URLs to nodes it > should just work. I've been thinking about how to address the issue of allowing different configurations for different needs. I think if all we do is tell a proxy node who its children are, how to map IDs to those children, and allow a proxy to also be a node, we can handle almost any configuration. Examples: * All Peers - 2 nodes in the system, A & B. A is configured so odd IDs map to A, even IDs map to B. B is configured with the same ID ranges. You can load-balance across nodes A & B and take advantage of increased write throughput. This is probably the simplest clustering scenario for people that don't have enough traffic to fully utilize a standalone proxy node. * 1 or more proxies, multiple nodes - The proxies are all configured identically to map document IDs among nodes A-J. Nodes A-J know nothing about each other or their parents. In this scenario you can add very easily add proxy nodes as needed to handle the increased load when aggregating results from more nodes. * Tree structure - The top-level proxies are configured to map document IDs to nodes. These nodes may in fact be other proxies which are then configured to map to their nodes. Except for multiple levels of proxies, this is the same as the above scenario. Does it sound reasonable to expect a proxy to be aware of its children but not vice-versa? In an actual implementation I see the list of children and their mappings being stored in a document so that it could be updated while running to add/remove children. Adding a child in this scenario would involve choosing an ID range, replicating the relevant data from the other children, and updating this mapping. This would depend on partial replication to replicate only the data needed for the new child. I don't see this as something that's too complex - the only issue I see is you'll probably need to replicate data at least twice, once before the proxy mapping is updated and once after to get any final data that was written to the other children since the first replication. This also assumes you've chosen a consistent hashing algorithm so that the data on all nodes doesn't have to change when adding a single new node. Removing a child node would be the opposite process. I could foresee us coming up with a tool to automate most if not all of this process, possibly only requiring the user to start the new CouchDB server, fill in some values in Futon for ID mappings, and press a button. Sound reasonable? - Ben
Re: Partitioned Clusters
On Fri, Feb 20, 2009 at 4:15 PM, Mike Malone wrote: > Hi, I don't think I've commented on this list before so let me briefly > introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer > (primarily web dev) and have some experience working with large clustered > databases. I worked for Pownce.com, but moved to Six Apart when they > acquired Pownce in November 2008. > > I like the idea of a tree-structure since it's simple to understand and > implement, but I think there may be cases where having multiple top-level > proxies may make sense. I think so. I think that there could be proxy overlap / redundancy across all levels of the tree, and also in the case of a flat tree. As long as the proxies agree on how to hash from URLs to nodes it should just work. -- Chris Anderson http://jchris.mfdz.com
Re: Partitioned Clusters
Hi, I don't think I've commented on this list before so let me briefly introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer (primarily web dev) and have some experience working with large clustered databases. I worked for Pownce.com, but moved to Six Apart when they acquired Pownce in November 2008. I like the idea of a tree-structure since it's simple to understand and implement, but I think there may be cases where having multiple top-level proxies may make sense. As Damien pointed out, the top-level proxies will need to re-reduce / merge the documents from each partition, which may become a bottleneck. Damien pointed out how a tree structure would help to mitigate this problem by moving some of the work to sub-nodes. But couldn't you also add additional top-level proxies (with clients randomly choosing one to communicate with) to increase capacity without requiring a tree structure. This would also remove the top-level proxy as a single point of failure for the system. Mike On Fri, Feb 20, 2009 at 2:55 PM, Chris Anderson wrote: > On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz wrote: > > > > On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote: > > > >>> > >>> Trees would be overkill except for with very large clusters. > >>> > >> > >>> With CouchDB map views, you need to combine results from every node in > a > >>> big merge sort. If you combine all results at a single node, the single > >>> clients ability to simultaneously pull data and sort data from all > other > >>> nodes may become the bottleneck. So to parallelize, you have multiple > >>> nodes > >>> doing a merge sort of sub nodes , then sending those results to another > >>> node > >>> to be combined further, etc. The same with with the reduce views, but > >>> instead of a merge sort it's just rereducing results. The natural > "shape" > >>> of > >>> that computation is a tree, with only the final root node at the top > >>> being > >>> the bottleneck, but now it has to maintain connections and merge the > sort > >>> values from far fewer nodes. > >>> > >>> -Damien > >> > >> > >> That makes sense and it clarifies one of my questions about this topic. > Is > >> the goal of partitioned clustering to increase performance for very > large > >> data sets, or to increase reliability? It would seem from this answere > >> that > >> the goal is to increase query performance by distributing the query > >> processing, and not to increase reliability. > > > > > > I see partitioning and clustering as 2 different things. Partitioning is > > data partitioning, spreading the data out across nodes, no node having > the > > complete database. Clustering is nodes having the same, or nearly the > same > > data (they might be behind on replicating changes, but otherwise they > have > > the same data). > > > > Partitioning would primarily increase write performance (updates > happening > > concurrently on many nodes) and the size of the data set. Partitioning > helps > > with client read scalability, but only for document reads, not views > > queries. Partitioning alone could reduce reliability, depending how > tolerant > > you are to missing portions of the database. > > > > Clustering would primarily address database reliability (failover), > address > > client read scalability for docs and views. Clustering doesn't help much > > with write performance because even if you spread out the update load, > the > > replication as the cluster syncs up means every node gets the update > anyway. > > It might be useful to deal with update spikes, where you get a bunch of > > updates at once and can wait for the replication delay to get everyone > > synced back up. > > > > For really big, really reliable database, I'd have clusters of > partitions, > > where the database is partitioned N ways, each each partition have at > least > > M identical cluster members. Increase N for larger databases and update > > load, M for higher availability and read load. > > > > Thanks for the clarification. > > Can you say anything about how you see rebalancing working? > > > > -- > Chris Anderson > http://jchris.mfdz.com >
Re: Partitioned Clusters
On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz wrote: > > On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote: > >>> >>> Trees would be overkill except for with very large clusters. >>> >> >>> With CouchDB map views, you need to combine results from every node in a >>> big merge sort. If you combine all results at a single node, the single >>> clients ability to simultaneously pull data and sort data from all other >>> nodes may become the bottleneck. So to parallelize, you have multiple >>> nodes >>> doing a merge sort of sub nodes , then sending those results to another >>> node >>> to be combined further, etc. The same with with the reduce views, but >>> instead of a merge sort it's just rereducing results. The natural "shape" >>> of >>> that computation is a tree, with only the final root node at the top >>> being >>> the bottleneck, but now it has to maintain connections and merge the sort >>> values from far fewer nodes. >>> >>> -Damien >> >> >> That makes sense and it clarifies one of my questions about this topic. Is >> the goal of partitioned clustering to increase performance for very large >> data sets, or to increase reliability? It would seem from this answere >> that >> the goal is to increase query performance by distributing the query >> processing, and not to increase reliability. > > > I see partitioning and clustering as 2 different things. Partitioning is > data partitioning, spreading the data out across nodes, no node having the > complete database. Clustering is nodes having the same, or nearly the same > data (they might be behind on replicating changes, but otherwise they have > the same data). > > Partitioning would primarily increase write performance (updates happening > concurrently on many nodes) and the size of the data set. Partitioning helps > with client read scalability, but only for document reads, not views > queries. Partitioning alone could reduce reliability, depending how tolerant > you are to missing portions of the database. > > Clustering would primarily address database reliability (failover), address > client read scalability for docs and views. Clustering doesn't help much > with write performance because even if you spread out the update load, the > replication as the cluster syncs up means every node gets the update anyway. > It might be useful to deal with update spikes, where you get a bunch of > updates at once and can wait for the replication delay to get everyone > synced back up. > > For really big, really reliable database, I'd have clusters of partitions, > where the database is partitioned N ways, each each partition have at least > M identical cluster members. Increase N for larger databases and update > load, M for higher availability and read load. > Thanks for the clarification. Can you say anything about how you see rebalancing working? -- Chris Anderson http://jchris.mfdz.com
Re: Partitioned Clusters
On Fri, Feb 20, 2009 at 1:37 PM, Stefan Karpinski wrote: > > That makes sense and it clarifies one of my questions about this topic. Is > the goal of partitioned clustering to increase performance for very large > data sets, or to increase reliability? It would seem from this answere that > the goal is to increase query performance by distributing the query > processing, and not to increase reliability. Data redundancy is taken care of orthogonally to partitioning. Each node will be able to handle maintaining N hot-failover backups. Whether the database is hosted on a single large node or partitioned among many small ones, the redundancy story is the same. Partitioning becomes useful when either the total update rate is greater than the hard-disk throughput on a single node, or the stored capacity is better managed by multiple disks. By spreading write load across nodes you can achieve greater throughput. The view queries must be sent to every node, so having docs partitioned also allows views to be calculated in parallel. It will be interesting to see if it makes sense to partition small databases across hundreds of nodes in the interested of performance. Chris -- Chris Anderson http://jchris.mfdz.com
Re: Partitioned Clusters
On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote: Trees would be overkill except for with very large clusters. With CouchDB map views, you need to combine results from every node in a big merge sort. If you combine all results at a single node, the single clients ability to simultaneously pull data and sort data from all other nodes may become the bottleneck. So to parallelize, you have multiple nodes doing a merge sort of sub nodes , then sending those results to another node to be combined further, etc. The same with with the reduce views, but instead of a merge sort it's just rereducing results. The natural "shape" of that computation is a tree, with only the final root node at the top being the bottleneck, but now it has to maintain connections and merge the sort values from far fewer nodes. -Damien That makes sense and it clarifies one of my questions about this topic. Is the goal of partitioned clustering to increase performance for very large data sets, or to increase reliability? It would seem from this answere that the goal is to increase query performance by distributing the query processing, and not to increase reliability. I see partitioning and clustering as 2 different things. Partitioning is data partitioning, spreading the data out across nodes, no node having the complete database. Clustering is nodes having the same, or nearly the same data (they might be behind on replicating changes, but otherwise they have the same data). Partitioning would primarily increase write performance (updates happening concurrently on many nodes) and the size of the data set. Partitioning helps with client read scalability, but only for document reads, not views queries. Partitioning alone could reduce reliability, depending how tolerant you are to missing portions of the database. Clustering would primarily address database reliability (failover), address client read scalability for docs and views. Clustering doesn't help much with write performance because even if you spread out the update load, the replication as the cluster syncs up means every node gets the update anyway. It might be useful to deal with update spikes, where you get a bunch of updates at once and can wait for the replication delay to get everyone synced back up. For really big, really reliable database, I'd have clusters of partitions, where the database is partitioned N ways, each each partition have at least M identical cluster members. Increase N for larger databases and update load, M for higher availability and read load. -Damien
Re: Partitioned Clusters
> > Trees would be overkill except for with very large clusters. > > With CouchDB map views, you need to combine results from every node in a > big merge sort. If you combine all results at a single node, the single > clients ability to simultaneously pull data and sort data from all other > nodes may become the bottleneck. So to parallelize, you have multiple nodes > doing a merge sort of sub nodes , then sending those results to another node > to be combined further, etc. The same with with the reduce views, but > instead of a merge sort it's just rereducing results. The natural "shape" of > that computation is a tree, with only the final root node at the top being > the bottleneck, but now it has to maintain connections and merge the sort > values from far fewer nodes. > > -Damien That makes sense and it clarifies one of my questions about this topic. Is the goal of partitioned clustering to increase performance for very large data sets, or to increase reliability? It would seem from this answere that the goal is to increase query performance by distributing the query processing, and not to increase reliability.
Re: Partitioned Clusters
On Feb 20, 2009, at 1:55 PM, Stefan Karpinski wrote: Hi, I thought I'd introduce myself since I'm new here on the couchdb list. I'm Stefan Karpinski. I've worked in the Monitoring Group at Akamai, Operations R&D at Citrix Online, and I'm nearly done with a PhD in computer networking at the moment. So I guess I've thought about this kind of stuff a bit ;-) I'm curious what the motivation behind a tree topology is. Not that it's not a viable approach, just why that and not a load-balancer in front of a bunch of "leaves" with lateral propagation between the leaves? Why should the load-balancing/proxying/caching node even be running couchdb? One reason I can see for a tree topology would be the hierarchical cache effect. But that would likely only make sense in certain circumstances. Being able to configure the topology to meet various needs, rather than enforcing one particular topology makes more sense to me overall. Trees would be overkill except for with very large clusters. With CouchDB map views, you need to combine results from every node in a big merge sort. If you combine all results at a single node, the single clients ability to simultaneously pull data and sort data from all other nodes may become the bottleneck. So to parallelize, you have multiple nodes doing a merge sort of sub nodes , then sending those results to another node to be combined further, etc. The same with with the reduce views, but instead of a merge sort it's just rereducing results. The natural "shape" of that computation is a tree, with only the final root node at the top being the bottleneck, but now it has to maintain connections and merge the sort values from far fewer nodes. -Damien
Re: Partitioned Clusters
On Fri, Feb 20, 2009 at 10:55 AM, Stefan Karpinski wrote: > Hi, I thought I'd introduce myself since I'm new here on the couchdb > list. I'm Stefan Karpinski. I've worked in the Monitoring Group at > Akamai, Operations R&D at Citrix Online, and I'm nearly done with a > PhD in computer networking at the moment. So I guess I've thought > about this kind of stuff a bit ;-) Glad to have you with us. :) > > I'm curious what the motivation behind a tree topology is. Not that > it's not a viable approach, just why that and not a load-balancer in > front of a bunch of "leaves" with lateral propagation between the > leaves? Why should the load-balancing/proxying/caching node even be > running couchdb? The reason to write the proxies as Erlang, is that they can avoid the JSON and HTTP overhead until the final stage, as well as use Erlang's inter-node communication and process management mojo. The tree structure also provides a nice mapping onto the existing reduce implementation. Inner nodes can store the reduction values for their leaf nodes and run the reduce function to come up with total values. > > One reason I can see for a tree topology would be the hierarchical > cache effect. But that would likely only make sense in certain > circumstances. Being able to configure the topology to meet various > needs, rather than enforcing one particular topology makes more sense > to me overall. I agree - as Ben points out the flat topology is just a special case of the tree (and would probably be ideal for anything less than hundreds of nodes). I'm not an expert on cluster layout, but the tree structure appeals to me mostly because changes to subtrees don't need to be propagated to the cluster root. That said, there's *plenty* that can be done with HTTP proxies (and probably implemented more quickly) so it's probably the best way to prototype any of these implementations. Chris -- Chris Anderson http://jchris.mfdz.com
Re: Partitioned Clusters
Hi, I thought I'd introduce myself since I'm new here on the couchdb list. I'm Stefan Karpinski. I've worked in the Monitoring Group at Akamai, Operations R&D at Citrix Online, and I'm nearly done with a PhD in computer networking at the moment. So I guess I've thought about this kind of stuff a bit ;-) I'm curious what the motivation behind a tree topology is. Not that it's not a viable approach, just why that and not a load-balancer in front of a bunch of "leaves" with lateral propagation between the leaves? Why should the load-balancing/proxying/caching node even be running couchdb? One reason I can see for a tree topology would be the hierarchical cache effect. But that would likely only make sense in certain circumstances. Being able to configure the topology to meet various needs, rather than enforcing one particular topology makes more sense to me overall. On 2/20/09, Robert Newson wrote: > Any thoughts as to how (or even if) this tree-wise result aggregation > would work for externals? > > I'm thinking specifically about couchdb-lucene, where multi-node > results aggregation is possible, given a framework like you propose > here. The results that couchdb-lucene produces can already be > aggregated, assuming there's a hook for the merge function (actually, > perhaps it's exactly reduce-shaped)... > > B. > > On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson wrote: >> On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning wrote: >>> Overall the model sounds very similar to what I was thinking. I just >>> have a few comments. >>> In this model documents are saved to a leaf node depending on a hash of the docid. This means that lookups are easy, and need only to touch the leaf node which holds the doc. Redundancy can be provided by maintaining R replicas of every leaf node. >>> >>> There are several use-cases where a true hash of the docid won't be the >>> optimal partitioning key. The simple case is where you want to partition >>> your data by user and in most non-trivial cases you won't be storing >>> all of a user's data under one document with the user's id as the docid. >>> A fairly simple solution would be allowing the developer to specify a >>> javascript >>> function somewhere (not sure where this should live...) that takes a >>> docid and >>> spits out a partition key. Then I could just prefix all my doc ids for >>> a specific user >>> with that user's id and write the appropriate partition function. >>> View queries, on the other hand, must be handled by every node. The requests are proxied down the tree to leaf nodes, which respond normally. Each proxy node then runs a merge sort algorithm (which can sort in constant space proportional to # of input streams) on the view results. This can happen recursively if the tree is deep. >>> >>> If the developer has control over partition keys as suggested above, it's >>> entirely possible to have applications where view queries only need data >>> from one partition. It would be great if we could do something smart here >>> or >>> have a way for the developer to indicate to Couch that all the data >>> should >>> be on only one partition. >>> >>> These are just nice-to-have features and the described cluster setup >>> could >>> still be extremely useful without them. >> >> I think they are both sensible optimizations. Damien's described the >> JS partition function before on IRC, so I think it fits into the >> model. As far as restricting view queries to just those docs within a >> particular id range, it might make sense to partition by giving each >> user their own database, rather than logic on the docid. In the case >> where you need data in a single db, but still have some queries that >> can be partitioned, its still a good optimization. Luckily even in the >> unoptimized case, if a node has no rows to contribute to the final >> view result than it should have a low impact on total resources needed >> to generate the result. >> >> Chris >> >> -- >> Chris Anderson >> http://jchris.mfdz.com >> >
Re: Partitioned Clusters
Any thoughts as to how (or even if) this tree-wise result aggregation would work for externals? I'm thinking specifically about couchdb-lucene, where multi-node results aggregation is possible, given a framework like you propose here. The results that couchdb-lucene produces can already be aggregated, assuming there's a hook for the merge function (actually, perhaps it's exactly reduce-shaped)... B. On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson wrote: > On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning wrote: >> Overall the model sounds very similar to what I was thinking. I just >> have a few comments. >> >>> In this model documents are saved to a leaf node depending on a hash >>> of the docid. This means that lookups are easy, and need only to touch >>> the leaf node which holds the doc. Redundancy can be provided by >>> maintaining R replicas of every leaf node. >> >> There are several use-cases where a true hash of the docid won't be the >> optimal partitioning key. The simple case is where you want to partition >> your data by user and in most non-trivial cases you won't be storing >> all of a user's data under one document with the user's id as the docid. >> A fairly simple solution would be allowing the developer to specify a >> javascript >> function somewhere (not sure where this should live...) that takes a docid >> and >> spits out a partition key. Then I could just prefix all my doc ids for >> a specific user >> with that user's id and write the appropriate partition function. >> >>> >>> View queries, on the other hand, must be handled by every node. The >>> requests are proxied down the tree to leaf nodes, which respond >>> normally. Each proxy node then runs a merge sort algorithm (which can >>> sort in constant space proportional to # of input streams) on the view >>> results. This can happen recursively if the tree is deep. >> >> If the developer has control over partition keys as suggested above, it's >> entirely possible to have applications where view queries only need data >> from one partition. It would be great if we could do something smart here or >> have a way for the developer to indicate to Couch that all the data should >> be on only one partition. >> >> These are just nice-to-have features and the described cluster setup could >> still be extremely useful without them. > > I think they are both sensible optimizations. Damien's described the > JS partition function before on IRC, so I think it fits into the > model. As far as restricting view queries to just those docs within a > particular id range, it might make sense to partition by giving each > user their own database, rather than logic on the docid. In the case > where you need data in a single db, but still have some queries that > can be partitioned, its still a good optimization. Luckily even in the > unoptimized case, if a node has no rows to contribute to the final > view result than it should have a low impact on total resources needed > to generate the result. > > Chris > > -- > Chris Anderson > http://jchris.mfdz.com >
Re: Partitioned Clusters
On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning wrote: > Overall the model sounds very similar to what I was thinking. I just > have a few comments. > >> In this model documents are saved to a leaf node depending on a hash >> of the docid. This means that lookups are easy, and need only to touch >> the leaf node which holds the doc. Redundancy can be provided by >> maintaining R replicas of every leaf node. > > There are several use-cases where a true hash of the docid won't be the > optimal partitioning key. The simple case is where you want to partition > your data by user and in most non-trivial cases you won't be storing > all of a user's data under one document with the user's id as the docid. > A fairly simple solution would be allowing the developer to specify a > javascript > function somewhere (not sure where this should live...) that takes a docid and > spits out a partition key. Then I could just prefix all my doc ids for > a specific user > with that user's id and write the appropriate partition function. > >> >> View queries, on the other hand, must be handled by every node. The >> requests are proxied down the tree to leaf nodes, which respond >> normally. Each proxy node then runs a merge sort algorithm (which can >> sort in constant space proportional to # of input streams) on the view >> results. This can happen recursively if the tree is deep. > > If the developer has control over partition keys as suggested above, it's > entirely possible to have applications where view queries only need data > from one partition. It would be great if we could do something smart here or > have a way for the developer to indicate to Couch that all the data should > be on only one partition. > > These are just nice-to-have features and the described cluster setup could > still be extremely useful without them. I think they are both sensible optimizations. Damien's described the JS partition function before on IRC, so I think it fits into the model. As far as restricting view queries to just those docs within a particular id range, it might make sense to partition by giving each user their own database, rather than logic on the docid. In the case where you need data in a single db, but still have some queries that can be partitioned, its still a good optimization. Luckily even in the unoptimized case, if a node has no rows to contribute to the final view result than it should have a low impact on total resources needed to generate the result. Chris -- Chris Anderson http://jchris.mfdz.com
Re: Partitioned Clusters
Overall the model sounds very similar to what I was thinking. I just have a few comments. > In this model documents are saved to a leaf node depending on a hash > of the docid. This means that lookups are easy, and need only to touch > the leaf node which holds the doc. Redundancy can be provided by > maintaining R replicas of every leaf node. There are several use-cases where a true hash of the docid won't be the optimal partitioning key. The simple case is where you want to partition your data by user and in most non-trivial cases you won't be storing all of a user's data under one document with the user's id as the docid. A fairly simple solution would be allowing the developer to specify a javascript function somewhere (not sure where this should live...) that takes a docid and spits out a partition key. Then I could just prefix all my doc ids for a specific user with that user's id and write the appropriate partition function. > > View queries, on the other hand, must be handled by every node. The > requests are proxied down the tree to leaf nodes, which respond > normally. Each proxy node then runs a merge sort algorithm (which can > sort in constant space proportional to # of input streams) on the view > results. This can happen recursively if the tree is deep. If the developer has control over partition keys as suggested above, it's entirely possible to have applications where view queries only need data from one partition. It would be great if we could do something smart here or have a way for the developer to indicate to Couch that all the data should be on only one partition. These are just nice-to-have features and the described cluster setup could still be extremely useful without them. The tree setup sounds interesting but I wonder how it would compare in latency to a flat setup with the same number of leaf nodes. As long as the developer can control the tree structure (# of children per parent) then this concern shouldn't be an issue. - Ben