Hi Peter,

It's great to hear that others have come to some of the same conclusions!

I think a CRUSH-like strategy for topologically aware
replication/routing/locality is a great idea. I think I can see three
mostly orthogonal sets of functionality that we're concerned with:

a) a virtual node partitioning scheme (to support heterogeneity and
management simplicity)
b) topology aware replication
c) topology aware routing

First of all, I think that while (c) depends on (b) it does not affect
partitioning or replication directly, so I'm going to set that aside
for the moment and talk just about the former two.

I'll summarise your design here, mainly to make sure that I understand
it, but also to refer back to it:

1. The hash-space is partitioned into a fixed number of partitions
2. The CRUSH algorithm is run - select(1, disk) - over the topology
using each partition as a key, to get an assignment of partition ->
physical host (primary)
2a. adding or removing a node requires re-running CRUSH to recalculate
the partition assignment (and move data)
3. The CRUSH algorithm is run - select(RF-1, disk) - over the topology
using each primary host id, to get an assignment of primary host ->
RF-1 replicas
3a. adding or removing a node requires re-running CRUSH to recalculate
replica assignment (which might be a different set of hosts to
before?)

Here are some thoughts:
(clarification: when I'm talking about buckets, I'm referring to the
same concept as in the CRUSH paper!)

One of my concerns about using CRUSH exactly as described in the paper
is that it seems to be sub-optimal in the amount of data that it moves
after modifying the topology. The authors of the paper introduce
several "bucket types" (uniform, list, tree, straw) which appear to be
various sub-optimal alternatives to consistent hashing, with various
trade-offs. Why not use consistent hashing? Given (2a) and (3a) I
think we might end up moving way too much data when the set of
replicas changes completely for a given host.

Let's suppose we introduce our own bucket type called a "ring bucket".
Each item in a ring bucket is assigned an equal, non-contiguous
portion of the key hash-space, which determines which keys are
assigned to it. When an item is added to the ring bucket, it takes an
equal portion of the hash-space from every other item already in the
bucket. And vice-versa for removals. It's easy to see that this ring
bucket implements consistent hashing with some unspecified virtual
node scheme. Additions and removals would be optimal (only \deltaw/W
keys require moving when the topology changes).

Using this ring bucket in the CRUSH topology, (with the hash function
being the identity function) would give the exact same distribution
properties as the virtual node strategy that I suggested previously,
but of course with much better topology awareness.

This makes it evident that the partitioning scheme, and a CRUSH-like
replication scheme are orthogonal concerns. In the same way as NTS
currently uses the ring to provide distribution at DC and rack level
by conceptually separating the ring into a distinct logical rings for
each DC, a CrushReplicationStrategy could use the ring as its
bucketing function to distribute partitions in the topology.

This brings me on to (1) and the reasons for our choice of virtual
node scheme - choose N random tokens - instead of the Dynamo-like
scheme that you suggest where the partitions are fixed in advance.
With the Dynamo scheme, the size of a virtual node partition will only
ever grow as more data is inserted. Since the number of partitions is
fixed when the cluster is created, the partition size is unbounded.

There are certain advantages to having a limit on partition size.
Streaming failures that cause retries do not have to resend so much
data. Streaming operations can be staggered in smaller chunks to
minimise the impact on the nodes involved. Load balancing can operate
on a finer granularity.

In the N tokens per node scheme, adding nodes to the cluster decreases
the partition size and so gives some control about how much data is
stored in each partition. The average size can be reduced by adding
more machines to the cluster.

The other concern you mentioned was
> The probability of data loss increases linearly with cluster size.

but you also acknowledge that

> In making this determination, one must take into account that if a
> larger `DF` makes reconstruction/replacement significantly faster,
> that also decreases the time window in which multiple failures can
> occurr. Increasing `DF` is thus not *necessarily* increasing the total
> probability of data loss (for small values of `DF`).

Our calculations lead us to believe that in fact the shorter rebuild
window more than compensates for the increased probability of multiple
failure, so with DF=N the probability of data loss is minimised.

The CRUSH paper also states:

"With 2-way mirroring these two factors cancel
each other out, while overall data safety with more than two
replicas increases with declustering [Xin et al. 2004]"

("declustering" meaning increasing DF towards N)


Regards,


--
Sam Overton
Acunu | http://www.acunu.com | @acunu

Reply via email to