Re: [ceph-users] Musings
Thanks, your responses have been helpful. On Tue, Aug 19, 2014 at 1:48 PM, Gregory Farnum wrote: > On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc > wrote: > > Greg, thanks for the reply, please see in-line. > > > > > > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum > wrote: > >> > >> > >> There are many groups running cluster >1PB, but whatever makes you > >> comfortable. There is a bit more of a learning curve once you reach a > >> certain scale than there is with smaller installations. > > > > > > What do you find to be the most difficult issues at large scale? It may > help > > ease some of the concerns if we know what we can expect. > > Well, I'm a developer, not a maintainer, so I'm probably the wrong > person to ask about what surprises people. But in general it's stuff > like: > 1) Tunable settings matter more > 2) Behavior that was unfortunate but left the cluster alive in a small > cluster (eg, you have a bunch of slow OSDs that keep flapping) could > turn into a data non-availability event in a large one (because with > that many more OSDs misbehaving it overwhelms the monitors or > something) > 3) Resource consumption limits start popping up (eg, fd and pid limits > need to be increased) > > Things like that. These are generally a matter of admin education at > this scale (the code issues are fairly well sorted-out by now, > although there were plenty of those to be found on the first > multi-petabyte-scale cluster). > > > > >> Yeah, there's no merging of Ceph clusters and I don't think there ever > >> will be. Setting up the CRUSH maps this way to start, and only having > >> a single entry for most of the levels, would work just fine though. > > > > > > Thanks for confirming my suspicions. If we start with a CRUSH map > designed > > well, we can probably migrate the data outside of Ceph and just grow one > > system and as the other empy, reformat them and bring them in. > > > >> Yeah, there is very little world Ceph experience with cache pools, and > >> there's a lot working with an SSD journal + hard drive backing store; > >> I'd start with that. > > > > > > Other thoughts are using something like bcache or dm-cache on each OSD. > > bcache is tempting because a single SSD device can serve multiple disks > > where dm-cache has to have a separate SSD device/partition for each disk > > (plus metadata). I plan on testing this unless someone says that it is > > absolutely not worth the time. > > > >> > >> Yeah, no async replication at all for generic workloads. You can do > >> the "2 my rack and one in a different rack" thing just fine, although > >> it's a little tricky to set up. (There are email threads about this > >> that hopefully you can find; I've been part of one of them.) The > >> min_size is all about preserving a minimum resiliency of *every* write > >> (if a PG's replication is degraded but not yet repaired); if you had a > >> 2+1 setup then min_size of 2 would just make sure there are at least > >> two copies somewhere (but not that they're in different racks or > >> whatever). > > > > > > The current discussion in the office is if the cluster (2+1) is HEALTHY, > > does the write return after 2 of the OSDs (itself and one replica) > complete > > the write or only after all three have completed the write? We are > planning > > to try to do some testing on this as well if a clear answer can't be > found. > > It's only after all three have completed the write. Every write to > Ceph is replicated synchronously to every OSD which is actively > hosting the PG that the object resides in. > -Greg > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Musings
On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc wrote: > Greg, thanks for the reply, please see in-line. > > > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum wrote: >> >> >> There are many groups running cluster >1PB, but whatever makes you >> comfortable. There is a bit more of a learning curve once you reach a >> certain scale than there is with smaller installations. > > > What do you find to be the most difficult issues at large scale? It may help > ease some of the concerns if we know what we can expect. Well, I'm a developer, not a maintainer, so I'm probably the wrong person to ask about what surprises people. But in general it's stuff like: 1) Tunable settings matter more 2) Behavior that was unfortunate but left the cluster alive in a small cluster (eg, you have a bunch of slow OSDs that keep flapping) could turn into a data non-availability event in a large one (because with that many more OSDs misbehaving it overwhelms the monitors or something) 3) Resource consumption limits start popping up (eg, fd and pid limits need to be increased) Things like that. These are generally a matter of admin education at this scale (the code issues are fairly well sorted-out by now, although there were plenty of those to be found on the first multi-petabyte-scale cluster). > >> Yeah, there's no merging of Ceph clusters and I don't think there ever >> will be. Setting up the CRUSH maps this way to start, and only having >> a single entry for most of the levels, would work just fine though. > > > Thanks for confirming my suspicions. If we start with a CRUSH map designed > well, we can probably migrate the data outside of Ceph and just grow one > system and as the other empy, reformat them and bring them in. > >> Yeah, there is very little world Ceph experience with cache pools, and >> there's a lot working with an SSD journal + hard drive backing store; >> I'd start with that. > > > Other thoughts are using something like bcache or dm-cache on each OSD. > bcache is tempting because a single SSD device can serve multiple disks > where dm-cache has to have a separate SSD device/partition for each disk > (plus metadata). I plan on testing this unless someone says that it is > absolutely not worth the time. > >> >> Yeah, no async replication at all for generic workloads. You can do >> the "2 my rack and one in a different rack" thing just fine, although >> it's a little tricky to set up. (There are email threads about this >> that hopefully you can find; I've been part of one of them.) The >> min_size is all about preserving a minimum resiliency of *every* write >> (if a PG's replication is degraded but not yet repaired); if you had a >> 2+1 setup then min_size of 2 would just make sure there are at least >> two copies somewhere (but not that they're in different racks or >> whatever). > > > The current discussion in the office is if the cluster (2+1) is HEALTHY, > does the write return after 2 of the OSDs (itself and one replica) complete > the write or only after all three have completed the write? We are planning > to try to do some testing on this as well if a clear answer can't be found. It's only after all three have completed the write. Every write to Ceph is replicated synchronously to every OSD which is actively hosting the PG that the object resides in. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Musings
Greg, thanks for the reply, please see in-line. On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum wrote: > > There are many groups running cluster >1PB, but whatever makes you > comfortable. There is a bit more of a learning curve once you reach a > certain scale than there is with smaller installations. > What do you find to be the most difficult issues at large scale? It may help ease some of the concerns if we know what we can expect. Yeah, there's no merging of Ceph clusters and I don't think there ever > will be. Setting up the CRUSH maps this way to start, and only having > a single entry for most of the levels, would work just fine though. > Thanks for confirming my suspicions. If we start with a CRUSH map designed well, we can probably migrate the data outside of Ceph and just grow one system and as the other empy, reformat them and bring them in. Yeah, there is very little world Ceph experience with cache pools, and > there's a lot working with an SSD journal + hard drive backing store; > I'd start with that. > Other thoughts are using something like bcache or dm-cache on each OSD. bcache is tempting because a single SSD device can serve multiple disks where dm-cache has to have a separate SSD device/partition for each disk (plus metadata). I plan on testing this unless someone says that it is absolutely not worth the time. > Yeah, no async replication at all for generic workloads. You can do > the "2 my rack and one in a different rack" thing just fine, although > it's a little tricky to set up. (There are email threads about this > that hopefully you can find; I've been part of one of them.) The > min_size is all about preserving a minimum resiliency of *every* write > (if a PG's replication is degraded but not yet repaired); if you had a > 2+1 setup then min_size of 2 would just make sure there are at least > two copies somewhere (but not that they're in different racks or > whatever). > The current discussion in the office is if the cluster (2+1) is HEALTHY, does the write return after 2 of the OSDs (itself and one replica) complete the write or only after all three have completed the write? We are planning to try to do some testing on this as well if a clear answer can't be found. Thank you, Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Musings
On Thu, Aug 14, 2014 at 12:40 PM, Robert LeBlanc wrote: > We are looking to deploy Ceph in our environment and I have some musings > that I would like some feedback on. There are concerns about scaling a > single Ceph instance to the PBs of size we would use, so the idea is to > start small like once Ceph cluster per rack or two. There are many groups running cluster >1PB, but whatever makes you comfortable. There is a bit more of a learning curve once you reach a certain scale than there is with smaller installations. > Then as we feel more > comfortable with it, then expand/combine clusters into larger systems. I'm > not sure that it is possible to combine discrete Ceph clusters. It also > seems to make sense to build a CRUSH map that defines regions, data centers, > sections, rows, racks, and hosts now so that there is less data migration > later, but I'm not sure how a merge would work. Yeah, there's no merging of Ceph clusters and I don't think there ever will be. Setting up the CRUSH maps this way to start, and only having a single entry for most of the levels, would work just fine though. > > I've been also toying with the idea of SSD journal per node verses SSD cache > tier pool verses lots of RAM for cache. Based on the performance webinar > today, it seems that cache misses in the cache pool causes a lot of writing > to the cache pool and severely degrades performance. I certainly like the > idea of a heat map that way a single read of an entire VM (backup, rsync) > won't kill the cache pool. Yeah, there is very little world Ceph experience with cache pools, and there's a lot working with an SSD journal + hard drive backing store; I'd start with that. > I've also been bouncing the idea to have data locality by configuring the > CRUSH map to keep two of the three replicas within the same row and the > third replica just somewhere in the data center. Based on a conversation on > the IRC a couple of days ago, it seems that this could work very will if > min_size is 2. But the documentation and the objective of Ceph seems to > indicate that min_size only applies in degraded situations. During normal > operation a write would have to be acknowledged by all three replicas before > being returned to the client, otherwise it would be eventually consistent > and not strongly consistent (I do like the idea of eventually consistent for > replication as long as we can be strongly consistent in some form at the > same time like 2 out of 3). Yeah, no async replication at all for generic workloads. You can do the "2 my rack and one in a different rack" thing just fine, although it's a little tricky to set up. (There are email threads about this that hopefully you can find; I've been part of one of them.) The min_size is all about preserving a minimum resiliency of *every* write (if a PG's replication is degraded but not yet repaired); if you had a 2+1 setup then min_size of 2 would just make sure there are at least two copies somewhere (but not that they're in different racks or whatever). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Musings
We are looking to deploy Ceph in our environment and I have some musings that I would like some feedback on. There are concerns about scaling a single Ceph instance to the PBs of size we would use, so the idea is to start small like once Ceph cluster per rack or two. Then as we feel more comfortable with it, then expand/combine clusters into larger systems. I'm not sure that it is possible to combine discrete Ceph clusters. It also seems to make sense to build a CRUSH map that defines regions, data centers, sections, rows, racks, and hosts now so that there is less data migration later, but I'm not sure how a merge would work. I've been also toying with the idea of SSD journal per node verses SSD cache tier pool verses lots of RAM for cache. Based on the performance webinar today, it seems that cache misses in the cache pool causes a lot of writing to the cache pool and severely degrades performance. I certainly like the idea of a heat map that way a single read of an entire VM (backup, rsync) won't kill the cache pool. I've also been bouncing the idea to have data locality by configuring the CRUSH map to keep two of the three replicas within the same row and the third replica just somewhere in the data center. Based on a conversation on the IRC a couple of days ago, it seems that this could work very will if min_size is 2. But the documentation and the objective of Ceph seems to indicate that min_size only applies in degraded situations. During normal operation a write would have to be acknowledged by all three replicas before being returned to the client, otherwise it would be eventually consistent and not strongly consistent (I do like the idea of eventually consistent for replication as long as we can be strongly consistent in some form at the same time like 2 out of 3). I've read through the online manual, so now I'm looking for personal perspectives that you may have. Thanks, Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com