Hi Peter, It's great to hear that others have come to some of the same conclusions!
I think a CRUSH-like strategy for topologically aware replication/routing/locality is a great idea. I think I can see three mostly orthogonal sets of functionality that we're concerned with: a) a virtual node partitioning scheme (to support heterogeneity and management simplicity) b) topology aware replication c) topology aware routing First of all, I think that while (c) depends on (b) it does not affect partitioning or replication directly, so I'm going to set that aside for the moment and talk just about the former two. I'll summarise your design here, mainly to make sure that I understand it, but also to refer back to it: 1. The hash-space is partitioned into a fixed number of partitions 2. The CRUSH algorithm is run - select(1, disk) - over the topology using each partition as a key, to get an assignment of partition -> physical host (primary) 2a. adding or removing a node requires re-running CRUSH to recalculate the partition assignment (and move data) 3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology using each primary host id, to get an assignment of primary host -> RF-1 replicas 3a. adding or removing a node requires re-running CRUSH to recalculate replica assignment (which might be a different set of hosts to before?) Here are some thoughts: (clarification: when I'm talking about buckets, I'm referring to the same concept as in the CRUSH paper!) One of my concerns about using CRUSH exactly as described in the paper is that it seems to be sub-optimal in the amount of data that it moves after modifying the topology. The authors of the paper introduce several "bucket types" (uniform, list, tree, straw) which appear to be various sub-optimal alternatives to consistent hashing, with various trade-offs. Why not use consistent hashing? Given (2a) and (3a) I think we might end up moving way too much data when the set of replicas changes completely for a given host. Let's suppose we introduce our own bucket type called a "ring bucket". Each item in a ring bucket is assigned an equal, non-contiguous portion of the key hash-space, which determines which keys are assigned to it. When an item is added to the ring bucket, it takes an equal portion of the hash-space from every other item already in the bucket. And vice-versa for removals. It's easy to see that this ring bucket implements consistent hashing with some unspecified virtual node scheme. Additions and removals would be optimal (only \deltaw/W keys require moving when the topology changes). Using this ring bucket in the CRUSH topology, (with the hash function being the identity function) would give the exact same distribution properties as the virtual node strategy that I suggested previously, but of course with much better topology awareness. This makes it evident that the partitioning scheme, and a CRUSH-like replication scheme are orthogonal concerns. In the same way as NTS currently uses the ring to provide distribution at DC and rack level by conceptually separating the ring into a distinct logical rings for each DC, a CrushReplicationStrategy could use the ring as its bucketing function to distribute partitions in the topology. This brings me on to (1) and the reasons for our choice of virtual node scheme - choose N random tokens - instead of the Dynamo-like scheme that you suggest where the partitions are fixed in advance. With the Dynamo scheme, the size of a virtual node partition will only ever grow as more data is inserted. Since the number of partitions is fixed when the cluster is created, the partition size is unbounded. There are certain advantages to having a limit on partition size. Streaming failures that cause retries do not have to resend so much data. Streaming operations can be staggered in smaller chunks to minimise the impact on the nodes involved. Load balancing can operate on a finer granularity. In the N tokens per node scheme, adding nodes to the cluster decreases the partition size and so gives some control about how much data is stored in each partition. The average size can be reduced by adding more machines to the cluster. The other concern you mentioned was > The probability of data loss increases linearly with cluster size. but you also acknowledge that > In making this determination, one must take into account that if a > larger `DF` makes reconstruction/replacement significantly faster, > that also decreases the time window in which multiple failures can > occurr. Increasing `DF` is thus not *necessarily* increasing the total > probability of data loss (for small values of `DF`). Our calculations lead us to believe that in fact the shorter rebuild window more than compensates for the increased probability of multiple failure, so with DF=N the probability of data loss is minimised. The CRUSH paper also states: "With 2-way mirroring these two factors cancel each other out, while overall data safety with more than two replicas increases with declustering [Xin et al. 2004]" ("declustering" meaning increasing DF towards N) Regards, -- Sam Overton Acunu | http://www.acunu.com | @acunu