On Thursday, November 22, 2012 at 7:38 PM, Chen, Xiaoxi wrote:
> Hi list,
> I am thinking about the possibility to add some primitive in CRUSH to meet 
> the following user stories:
> A. "Same host", "Same rack"
> To balance between availability and performance ,one may like such a rule: 3 
> Replicas, Replica 1 and Replica 2 should in the same rack while Replica 3 
> reside in another rack.This is common because a typical d eployment in 
> datacenter usually has much fewer uplink bandwidth than backbone bandwidth.
>  
> More aggressive guys may even want same host, since the most common failure 
> is disk failure. And we have several disk (also means several OSDs) reside in 
> the same physical machine. If we can place Replica 1 & 2 on the same host but 
> replica 3 in somewhere else.It will not only reduce replication traffic but 
> also saving a lot of time & bandwidth when disk failure happened and a 
> recovery take place.
This is a feature we're definitely interested in! The difficulty with this (as 
I understand it) is that right now the CRUSH code is very parallel and 
even-handed — each instruction in a CRUSH rule is executed in sequence on every 
bucket it has in its set. Somebody would need to change it so that you could 
say something like:
step take root
step choose firstn -1 rack
step rack0 choose 2 device
step rackn choose 1 device
emit
  
> B."local"
> Although we cannot mount RBD volumes to where a OSD running at, but QEMU 
> canbe used. This scenarios is really common in cloud computing. We have a 
> large amount of compute-nodes, just plug in some disks and make the machines 
> reused for Ceph cluster. To reduce network traffic and latency , if it is 
> possible to have some placement-group-maybe 3 PG for a compute-node. Define 
> the rules like: primary copy of the PG should (if possible) reside in 
> localhost, the second replica should go different places
>  
> By doing this , a significant amount of network bandwidth & a RTT can be 
> saved. What's more ,since read always go to primary, it will benefit a lot 
> from such mechanism.
>  
> It looks to me that A is simpler but B seems much complex. Hoping for inputs.
This has existed previously, in the form of local PGs and CRUSH force-feeding. 
We ripped it out in the name of simplicity and due to never really finding a 
justifiable use-case — the one we had was Hadoop, and the rumors we heard out 
of the big shops were that for that workload, local writes weren't actually a 
win…
The other issue with it is that the data use over the cluster's disks can 
become pretty badly unbalanced since placement is no longer pseudo-random, and 
Ceph still needs a lot of work on full-disk management before that's something 
we want to allow.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to