Re: [ceph-users] Musings

2014-08-19 Thread Robert LeBlanc
Thanks, your responses have been helpful.


On Tue, Aug 19, 2014 at 1:48 PM, Gregory Farnum  wrote:

> On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc 
> wrote:
> > Greg, thanks for the reply, please see in-line.
> >
> >
> > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum 
> wrote:
> >>
> >>
> >> There are many groups running cluster >1PB, but whatever makes you
> >> comfortable. There is a bit more of a learning curve once you reach a
> >> certain scale than there is with smaller installations.
> >
> >
> > What do you find to be the most difficult issues at large scale? It may
> help
> > ease some of the concerns if we know what we can expect.
>
> Well, I'm a developer, not a maintainer, so I'm probably the wrong
> person to ask about what surprises people. But in general it's stuff
> like:
> 1) Tunable settings matter more
> 2) Behavior that was unfortunate but left the cluster alive in a small
> cluster (eg, you have a bunch of slow OSDs that keep flapping) could
> turn into a data non-availability event in a large one (because with
> that many more OSDs misbehaving it overwhelms the monitors or
> something)
> 3) Resource consumption limits start popping up (eg, fd and pid limits
> need to be increased)
>
> Things like that. These are generally a matter of admin education at
> this scale (the code issues are fairly well sorted-out by now,
> although there were plenty of those to be found on the first
> multi-petabyte-scale cluster).
>
> >
> >> Yeah, there's no merging of Ceph clusters and I don't think there ever
> >> will be. Setting up the CRUSH maps this way to start, and only having
> >> a single entry for most of the levels, would work just fine though.
> >
> >
> > Thanks for confirming my suspicions. If we start with a CRUSH map
> designed
> > well, we can probably migrate the data outside of Ceph and just grow one
> > system and as the other empy, reformat them and bring them in.
> >
> >> Yeah, there is very little world Ceph experience with cache pools, and
> >> there's a lot working with an SSD journal + hard drive backing store;
> >> I'd start with that.
> >
> >
> > Other thoughts are using something like bcache or dm-cache on each OSD.
> > bcache is tempting because a single SSD device can serve multiple disks
> > where dm-cache has to have a separate SSD device/partition for each disk
> > (plus metadata). I plan on testing this unless someone says that it is
> > absolutely not worth the time.
> >
> >>
> >> Yeah, no async replication at all for generic workloads. You can do
> >> the "2 my rack and one in a different rack" thing just fine, although
> >> it's a little tricky to set up. (There are email threads about this
> >> that hopefully you can find; I've been part of one of them.) The
> >> min_size is all about preserving a minimum resiliency of *every* write
> >> (if a PG's replication is degraded but not yet repaired); if you had a
> >> 2+1 setup then min_size of 2 would just make sure there are at least
> >> two copies somewhere (but not that they're in different racks or
> >> whatever).
> >
> >
> > The current discussion in the office is if the cluster (2+1) is HEALTHY,
> > does the write return after 2 of the OSDs (itself and one replica)
> complete
> > the write or only after all three have completed the write? We are
> planning
> > to try to do some testing on this as well if a clear answer can't be
> found.
>
> It's only after all three have completed the write. Every write to
> Ceph is replicated synchronously to every OSD which is actively
> hosting the PG that the object resides in.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Musings

2014-08-19 Thread Gregory Farnum
On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc  wrote:
> Greg, thanks for the reply, please see in-line.
>
>
> On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum  wrote:
>>
>>
>> There are many groups running cluster >1PB, but whatever makes you
>> comfortable. There is a bit more of a learning curve once you reach a
>> certain scale than there is with smaller installations.
>
>
> What do you find to be the most difficult issues at large scale? It may help
> ease some of the concerns if we know what we can expect.

Well, I'm a developer, not a maintainer, so I'm probably the wrong
person to ask about what surprises people. But in general it's stuff
like:
1) Tunable settings matter more
2) Behavior that was unfortunate but left the cluster alive in a small
cluster (eg, you have a bunch of slow OSDs that keep flapping) could
turn into a data non-availability event in a large one (because with
that many more OSDs misbehaving it overwhelms the monitors or
something)
3) Resource consumption limits start popping up (eg, fd and pid limits
need to be increased)

Things like that. These are generally a matter of admin education at
this scale (the code issues are fairly well sorted-out by now,
although there were plenty of those to be found on the first
multi-petabyte-scale cluster).

>
>> Yeah, there's no merging of Ceph clusters and I don't think there ever
>> will be. Setting up the CRUSH maps this way to start, and only having
>> a single entry for most of the levels, would work just fine though.
>
>
> Thanks for confirming my suspicions. If we start with a CRUSH map designed
> well, we can probably migrate the data outside of Ceph and just grow one
> system and as the other empy, reformat them and bring them in.
>
>> Yeah, there is very little world Ceph experience with cache pools, and
>> there's a lot working with an SSD journal + hard drive backing store;
>> I'd start with that.
>
>
> Other thoughts are using something like bcache or dm-cache on each OSD.
> bcache is tempting because a single SSD device can serve multiple disks
> where dm-cache has to have a separate SSD device/partition for each disk
> (plus metadata). I plan on testing this unless someone says that it is
> absolutely not worth the time.
>
>>
>> Yeah, no async replication at all for generic workloads. You can do
>> the "2 my rack and one in a different rack" thing just fine, although
>> it's a little tricky to set up. (There are email threads about this
>> that hopefully you can find; I've been part of one of them.) The
>> min_size is all about preserving a minimum resiliency of *every* write
>> (if a PG's replication is degraded but not yet repaired); if you had a
>> 2+1 setup then min_size of 2 would just make sure there are at least
>> two copies somewhere (but not that they're in different racks or
>> whatever).
>
>
> The current discussion in the office is if the cluster (2+1) is HEALTHY,
> does the write return after 2 of the OSDs (itself and one replica) complete
> the write or only after all three have completed the write? We are planning
> to try to do some testing on this as well if a clear answer can't be found.

It's only after all three have completed the write. Every write to
Ceph is replicated synchronously to every OSD which is actively
hosting the PG that the object resides in.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Musings

2014-08-19 Thread Robert LeBlanc
Greg, thanks for the reply, please see in-line.


On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum  wrote:

>
> There are many groups running cluster >1PB, but whatever makes you
> comfortable. There is a bit more of a learning curve once you reach a
> certain scale than there is with smaller installations.
>

What do you find to be the most difficult issues at large scale? It may
help ease some of the concerns if we know what we can expect.

Yeah, there's no merging of Ceph clusters and I don't think there ever
> will be. Setting up the CRUSH maps this way to start, and only having
> a single entry for most of the levels, would work just fine though.
>

Thanks for confirming my suspicions. If we start with a CRUSH map designed
well, we can probably migrate the data outside of Ceph and just grow one
system and as the other empy, reformat them and bring them in.

Yeah, there is very little world Ceph experience with cache pools, and
> there's a lot working with an SSD journal + hard drive backing store;
> I'd start with that.
>

Other thoughts are using something like bcache or dm-cache on each OSD.
bcache is tempting because a single SSD device can serve multiple disks
where dm-cache has to have a separate SSD device/partition for each disk
(plus metadata). I plan on testing this unless someone says that it is
absolutely not worth the time.


> Yeah, no async replication at all for generic workloads. You can do
> the "2 my rack and one in a different rack" thing just fine, although
> it's a little tricky to set up. (There are email threads about this
> that hopefully you can find; I've been part of one of them.) The
> min_size is all about preserving a minimum resiliency of *every* write
> (if a PG's replication is degraded but not yet repaired); if you had a
> 2+1 setup then min_size of 2 would just make sure there are at least
> two copies somewhere (but not that they're in different racks or
> whatever).
>

The current discussion in the office is if the cluster (2+1) is HEALTHY,
does the write return after 2 of the OSDs (itself and one replica) complete
the write or only after all three have completed the write? We are planning
to try to do some testing on this as well if a clear answer can't be found.

Thank you,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Musings

2014-08-19 Thread Gregory Farnum
On Thu, Aug 14, 2014 at 12:40 PM, Robert LeBlanc  wrote:
> We are looking to deploy Ceph in our environment and I have some musings
> that I would like some feedback on. There are concerns about scaling a
> single Ceph instance to the PBs of size we would use, so the idea is to
> start small like once Ceph cluster per rack or two.

There are many groups running cluster >1PB, but whatever makes you
comfortable. There is a bit more of a learning curve once you reach a
certain scale than there is with smaller installations.

> Then as we feel more
> comfortable with it, then expand/combine clusters into larger systems. I'm
> not sure that it is possible to combine discrete Ceph clusters. It also
> seems to make sense to build a CRUSH map that defines regions, data centers,
> sections, rows, racks, and hosts now so that there is less data migration
> later, but I'm not sure how a merge would work.

Yeah, there's no merging of Ceph clusters and I don't think there ever
will be. Setting up the CRUSH maps this way to start, and only having
a single entry for most of the levels, would work just fine though.

>
> I've been also toying with the idea of SSD journal per node verses SSD cache
> tier pool verses lots of RAM for cache. Based on the performance webinar
> today, it seems that cache misses in the cache pool causes a lot of writing
> to the cache pool and severely degrades performance. I certainly like the
> idea of a heat map that way a single read of an entire VM (backup, rsync)
> won't kill the cache pool.

Yeah, there is very little world Ceph experience with cache pools, and
there's a lot working with an SSD journal + hard drive backing store;
I'd start with that.

> I've also been bouncing the idea to have data locality by configuring the
> CRUSH map to keep two of the three replicas within the same row and the
> third replica just somewhere in the data center. Based on a conversation on
> the IRC a couple of days ago, it seems that this could work very will if
> min_size is 2. But the documentation and the objective of Ceph seems to
> indicate that min_size only applies in degraded situations. During normal
> operation a write would have to be acknowledged by all three replicas before
> being returned to the client, otherwise it would be eventually consistent
> and not strongly consistent (I do like the idea of eventually consistent for
> replication as long as we can be strongly consistent in some form at the
> same time like 2 out of 3).

Yeah, no async replication at all for generic workloads. You can do
the "2 my rack and one in a different rack" thing just fine, although
it's a little tricky to set up. (There are email threads about this
that hopefully you can find; I've been part of one of them.) The
min_size is all about preserving a minimum resiliency of *every* write
(if a PG's replication is degraded but not yet repaired); if you had a
2+1 setup then min_size of 2 would just make sure there are at least
two copies somewhere (but not that they're in different racks or
whatever).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Musings

2014-08-14 Thread Robert LeBlanc
We are looking to deploy Ceph in our environment and I have some musings
that I would like some feedback on. There are concerns about scaling a
single Ceph instance to the PBs of size we would use, so the idea is to
start small like once Ceph cluster per rack or two. Then as we feel more
comfortable with it, then expand/combine clusters into larger systems. I'm
not sure that it is possible to combine discrete Ceph clusters. It also
seems to make sense to build a CRUSH map that defines regions, data
centers, sections, rows, racks, and hosts now so that there is less data
migration later, but I'm not sure how a merge would work.

I've been also toying with the idea of SSD journal per node verses SSD
cache tier pool verses lots of RAM for cache. Based on the performance
webinar today, it seems that cache misses in the cache pool causes a lot of
writing to the cache pool and severely degrades performance. I certainly
like the idea of a heat map that way a single read of an entire VM (backup,
rsync) won't kill the cache pool.

I've also been bouncing the idea to have data locality by configuring the
CRUSH map to keep two of the three replicas within the same row and the
third replica just somewhere in the data center. Based on a conversation on
the IRC a couple of days ago, it seems that this could work very will if
min_size is 2. But the documentation and the objective of Ceph seems to
indicate that min_size only applies in degraded situations. During normal
operation a write would have to be acknowledged by all three replicas
before being returned to the client, otherwise it would be eventually
consistent and not strongly consistent (I do like the idea of eventually
consistent for replication as long as we can be strongly consistent in some
form at the same time like 2 out of 3).

I've read through the online manual, so now I'm looking for personal
perspectives that you may have.

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com