On 2017-09-05 02:41 PM, Gregory Farnum wrote:
> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas <flor...@hastexo.com> > wrote: 
> >> Hi everyone, >> >> with the Luminous release out the door
and the Labor Day weekend >> over, I hope I can kick off a discussion on
another issue that has >> irked me a bit for quite a while. There
doesn't seem to be a good >> documented answer to this: what are Ceph's
real limits when it >> comes to RBD snapshots? >> >> For most people,
any RBD image will have perhaps a single-digit >> number of snapshots.
For example, in an OpenStack environment we >> typically have one
snapshot per Glance image, a few snapshots per >> Cinder volume, and
perhaps a few snapshots per ephemeral Nova disk >> (unless clones are
configured to flatten immediately). Ceph >> generally performs well
under those circumstances. >> >> However, things sometimes start getting
problematic when RBD >> snapshots are generated frequently, and in an
automated fashion. >> I've seen Ceph operators configure snapshots on a
daily or even >> hourly basis, typically when using snapshots as a
backup strategy >> (where they promise to allow for very short RTO and
RPO). In >> combination with thousands or maybe tens of thousands of
RBDs, >> that's a lot of snapshots. And in such scenarios (and only in
>> those), users have been bitten by a few nasty bugs in the past — >>
here's an example where the OSD snap trim queue went berserk in the >>
event of lots of snapshots being deleted: >> >>
http://tracker.ceph.com/issues/9487 >>
https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to
me that there still isn't a good recommendation along >> the lines of
"try not to have more than X snapshots per RBD image" >> or "try not to
have more than Y snapshots in the cluster overall". >> Or is the
"correct" recommendation actually "create as many >> snapshots as you
might possibly want, none of that is allowed to >> create any
instability nor performance degradation and if it does, >> that's a
bug"? > > I think we're closer to "as many snapshots as you want", but
there > are some known shortages there. > > First of all, if you haven't
seen my talk from the last OpenStack > summit on snapshots and you want
a bunch of details, go watch that. > :p >
https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1

There are a few dimensions there can be failures with snapshots:

> 1) right now the way we mark snapshots as deleted is suboptimal — > when 
> deleted they go into an interval_set in the OSDMap. So if you >
have a bunch of holes in your deleted snapshots, it is possible to >
inflate the osdmap to a size which causes trouble. But I'm not sure > if
we've actually seen this be an issue yet — it requires both a > large
cluster, and a large map, and probably some other failure > causing
osdmaps to be generated very rapidly.
In our use case, we are severly hampered by the size of removed_snaps
(50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in
PGPool::update and its interval calculation code. We have a cluster of
around 100k RBDs with each RBD having upto 25 snapshots and only a small
portion of our RBDs mapped at a time (~500-1000). For size / performance
reasons we try to keep the number of snapshots low (<25) and need to
prune snapshots. Since in our use case RBDs 'age' at different rates,
snapshot pruning creates holes to the point where we the size of the
removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph
clusters. I think in general around 2 snapshot removal operations
currently happen a minute just because of the volume of snapshots and
users we have.

We found the PGPool::update and the interval calculation code code to be
quite inefficient. Some small changes made it a lot faster giving more
breathing room, we shared and these and most already got applied:
https://github.com/ceph/ceph/pull/17088
https://github.com/ceph/ceph/pull/17121
https://github.com/ceph/ceph/pull/17239
https://github.com/ceph/ceph/pull/17265
https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes)

However for our use case these patches helped, but overall CPU usage in
this area is still high (>70% or so), making the Ceph cluster slow
causing blocked requests and many operations (e.g. rbd map) to take a
long time.

We are trying to work around these issues by trying to change our
snapshot strategy. In the short-term we are manually defragmenting the
interval set by scanning for holes and trying to delete snapids in
between holes to coalesce more holes. This is not so nice to do. In some
cases we employ strategies to 'recreate' old snapshots (as we need to
keep them) at higher snapids. For our use case a 'snapid rename' feature
would have been quite helpful.

I hope this shines some light on practical Ceph clusters in which
performance is bottlenecked not by I/O but by snapshot removal.

> 2) There may be issues with how rbd records what snapshots it is > associated 
> with? No idea about this; haven't heard of any. > > 3)
Trimming snapshots requires IO. This is where most (all?) of the >
issues I've seen have come from; either in it being unscheduled IO >
that the rest of the system doesn't account for or throttle (as in > the
links you highlighted) or in admins overwhelming the IO capacity > of
their clusters. At this point I think we've got everything being >
properly scheduled so it shouldn't break your cluster, but you can >
build up large queues of deferred work.
As mentioned above, we have been seeing that trimming is much more CPU
bound than IO bound. Our disks are mostly sitting idle while the OSD
daemons are completely pegging all of the CPUs in the cluster. We are
not in any way IO bound at this point, and we are certainly not
overwhelming the IO capacity of our clusters.

> > > -Greg > >> >> Looking forward to your thoughts. Thanks in advance!
>> >> Cheers, Florian _______________________________________________ >>
ceph-users mailing list ceph-users@lists.ceph.com >>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing >
list ceph-users@lists.ceph.com >
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to