On Thursday, November 22, 2012 at 7:38 PM, Chen, Xiaoxi wrote: > Hi list, > I am thinking about the possibility to add some primitive in CRUSH to meet > the following user stories: > A. "Same host", "Same rack" > To balance between availability and performance ,one may like such a rule: 3 > Replicas, Replica 1 and Replica 2 should in the same rack while Replica 3 > reside in another rack.This is common because a typical d eployment in > datacenter usually has much fewer uplink bandwidth than backbone bandwidth. > > More aggressive guys may even want same host, since the most common failure > is disk failure. And we have several disk (also means several OSDs) reside in > the same physical machine. If we can place Replica 1 & 2 on the same host but > replica 3 in somewhere else.It will not only reduce replication traffic but > also saving a lot of time & bandwidth when disk failure happened and a > recovery take place. This is a feature we're definitely interested in! The difficulty with this (as I understand it) is that right now the CRUSH code is very parallel and even-handed — each instruction in a CRUSH rule is executed in sequence on every bucket it has in its set. Somebody would need to change it so that you could say something like: step take root step choose firstn -1 rack step rack0 choose 2 device step rackn choose 1 device emit > B."local" > Although we cannot mount RBD volumes to where a OSD running at, but QEMU > canbe used. This scenarios is really common in cloud computing. We have a > large amount of compute-nodes, just plug in some disks and make the machines > reused for Ceph cluster. To reduce network traffic and latency , if it is > possible to have some placement-group-maybe 3 PG for a compute-node. Define > the rules like: primary copy of the PG should (if possible) reside in > localhost, the second replica should go different places > > By doing this , a significant amount of network bandwidth & a RTT can be > saved. What's more ,since read always go to primary, it will benefit a lot > from such mechanism. > > It looks to me that A is simpler but B seems much complex. Hoping for inputs. This has existed previously, in the form of local PGs and CRUSH force-feeding. We ripped it out in the name of simplicity and due to never really finding a justifiable use-case — the one we had was Hadoop, and the rumors we heard out of the big shops were that for that workload, local writes weren't actually a win… The other issue with it is that the data use over the cluster's disks can become pretty badly unbalanced since placement is no longer pseudo-random, and Ceph still needs a lot of work on full-disk management before that's something we want to allow. -Greg
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html