Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Oct 16, 2014 at 2:04 AM, Florian Haas wrote: > Hi Greg, > > sorry, this somehow got stuck in my drafts folder. > > On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum wrote: >> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas wrote: >>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas wrote: On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil wrote: > On Sun, 21 Sep 2014, Florian Haas wrote: >> So yes, I think your patch absolutely still has merit, as would any >> means of reducing the number of snapshots an OSD will trim in one go. >> As it is, the situation looks really really bad, specifically >> considering that RBD and RADOS are meant to be super rock solid, as >> opposed to say CephFS which is in an experimental state. And contrary >> to CephFS snapshots, I can't recall any documentation saying that RBD >> snapshots will break your system. > > Yeah, it sounds like a separate issue, and no, the limit is not > documented because it's definitely not the intended behavior. :) > > ...and I see you already have a log attached to #9503. Will take a look. I've already updated that issue in Redmine, but for the list archives I should also add this here: Dan's patch for #9503, together with Sage's for #9487, makes the problem go away in an instant. I've already pointed out that I owe Dan dinner, and Sage, well I already owe Sage pretty much lifelong full board. :) >>> >>> Looks like I was bit too eager: while the cluster is behaving nicely >>> with these patches while nothing happens to any OSDs, it does flag PGs >>> as incomplete when an OSD goes down. Once the mon osd down out >>> interval expires things seem to recover/backfill normally, but it's >>> still disturbing to see this in the interim. >>> >>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from >>> one of the affected PGs, within the mon osd down out interval, while >>> it was marked incomplete. >>> >>> Dan or Sage, any ideas as to what might be causing this? >> >> That *looks* like it's just because the pool has both size and >> min_size set to 2? > > Correct. But the documentation did not reflect that this is a > perfectly expected side effect of having min_size > 1. > > pg-states.rst says: > > *Incomplete* > Ceph detects that a placement group is missing a necessary period of history > from its log. If you see this state, report a bug, and try to start any > failed OSDs that may contain the needed information. > > So if min_size > 1 and replicas < min_size, then the incomplete state > is not a bug but a perfectly expected occurrence, correct? > > It's still a bit weird in that the PG seems to behave differently > depending on min_size. If min_size == 1 (default), then a PG with no > remaining replicas is stale, unless a replica failed first and the > primary was written to, after which it also failed, and the replica > then comes up and can't go primary because it now has outdated data, > in which case the PG goes "down". It never goes "incomplete". > > So is the documentation wrong, or is there something fishy with the > reported state of the PGs? I guess the documentation is wrong, although I thought we'd fixed that particular one. :/ Giant actually distinguishes between these conditions by adding an "undersized" state to the PG, so it'll be easier to diagnose. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Greg, sorry, this somehow got stuck in my drafts folder. On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum wrote: > On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas wrote: >> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas wrote: >>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil wrote: On Sun, 21 Sep 2014, Florian Haas wrote: > So yes, I think your patch absolutely still has merit, as would any > means of reducing the number of snapshots an OSD will trim in one go. > As it is, the situation looks really really bad, specifically > considering that RBD and RADOS are meant to be super rock solid, as > opposed to say CephFS which is in an experimental state. And contrary > to CephFS snapshots, I can't recall any documentation saying that RBD > snapshots will break your system. Yeah, it sounds like a separate issue, and no, the limit is not documented because it's definitely not the intended behavior. :) ...and I see you already have a log attached to #9503. Will take a look. >>> >>> I've already updated that issue in Redmine, but for the list archives >>> I should also add this here: Dan's patch for #9503, together with >>> Sage's for #9487, makes the problem go away in an instant. I've >>> already pointed out that I owe Dan dinner, and Sage, well I already >>> owe Sage pretty much lifelong full board. :) >> >> Looks like I was bit too eager: while the cluster is behaving nicely >> with these patches while nothing happens to any OSDs, it does flag PGs >> as incomplete when an OSD goes down. Once the mon osd down out >> interval expires things seem to recover/backfill normally, but it's >> still disturbing to see this in the interim. >> >> I've updated http://tracker.ceph.com/issues/9503 with a pg query from >> one of the affected PGs, within the mon osd down out interval, while >> it was marked incomplete. >> >> Dan or Sage, any ideas as to what might be causing this? > > That *looks* like it's just because the pool has both size and > min_size set to 2? Correct. But the documentation did not reflect that this is a perfectly expected side effect of having min_size > 1. pg-states.rst says: *Incomplete* Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information. So if min_size > 1 and replicas < min_size, then the incomplete state is not a bug but a perfectly expected occurrence, correct? It's still a bit weird in that the PG seems to behave differently depending on min_size. If min_size == 1 (default), then a PG with no remaining replicas is stale, unless a replica failed first and the primary was written to, after which it also failed, and the replica then comes up and can't go primary because it now has outdated data, in which case the PG goes "down". It never goes "incomplete". So is the documentation wrong, or is there something fishy with the reported state of the PGs? Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
It's in giant, the firefly backport will happen once we are happy with the fallout from the 80.7 thing. -Sam On Wed, Oct 15, 2014 at 7:47 AM, Dan Van Der Ster wrote: > Hi Sage, > >> On 19 Sep 2014, at 17:37, Dan Van Der Ster wrote: >> >> September 19 2014 5:19 PM, "Sage Weil" wrote: >>> On Fri, 19 Sep 2014, Dan van der Ster wrote: >>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster wrote: >> On 19 Sep 2014, at 08:12, Florian Haas wrote: >> >> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: >>> On Fri, 19 Sep 2014, Florian Haas wrote: Hi Sage, was the off-list reply intentional? >>> >>> Whoops! Nope :) >>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: >> So, disaster is a pretty good description. Would anyone from the core >> team like to suggest another course of action or workaround, or are >> Dan and I generally on the right track to make the best out of a >> pretty bad situation? > > The short term fix would probably be to just prevent backfill for the > time > being until the bug is fixed. As in, osd max backfills = 0? >>> >>> Yeah :) >>> >>> Just managed to reproduce the problem... >>> >>> sage >> >> Saw the wip branch. Color me freakishly impressed on the turnaround. :) >> Thanks! > > Indeed :) Thanks Sage! > wip-9487-dumpling fixes the problem on my test cluster. Trying in prod > now? Final update, after 4 hours in prod and after draining 8 OSDs -- zero slow requests :) >>> >>> That's great news! >>> >>> But, please be careful. This code hasn't been reiewed yet or been through >>> any testing! I would hold off on further backfills until it's merged. > > > Any news on those merges? It would be good to get this fixed on the dumpling > and firefly branches. We're kind of stuck at the moment :( > > Cheers, Dan > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Sage, > On 19 Sep 2014, at 17:37, Dan Van Der Ster wrote: > > September 19 2014 5:19 PM, "Sage Weil" wrote: >> On Fri, 19 Sep 2014, Dan van der Ster wrote: >> >>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster >>> wrote: > On 19 Sep 2014, at 08:12, Florian Haas wrote: > > On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: >> On Fri, 19 Sep 2014, Florian Haas wrote: >>> Hi Sage, >>> >>> was the off-list reply intentional? >> >> Whoops! Nope :) >> >>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: > So, disaster is a pretty good description. Would anyone from the core > team like to suggest another course of action or workaround, or are > Dan and I generally on the right track to make the best out of a > pretty bad situation? The short term fix would probably be to just prevent backfill for the time being until the bug is fixed. >>> >>> As in, osd max backfills = 0? >> >> Yeah :) >> >> Just managed to reproduce the problem... >> >> sage > > Saw the wip branch. Color me freakishly impressed on the turnaround. :) > Thanks! Indeed :) Thanks Sage! wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now? >>> >>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero >>> slow requests :) >> >> That's great news! >> >> But, please be careful. This code hasn't been reiewed yet or been through >> any testing! I would hold off on further backfills until it's merged. Any news on those merges? It would be good to get this fixed on the dumpling and firefly branches. We're kind of stuck at the moment :( Cheers, Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil wrote: > Sam and I discussed this on IRC and have we think two simpler patches that > solve the problem more directly. See wip-9487. So I understand this makes Dan's patch (and the config parameter that it introduces) unnecessary, but is it correct to assume that just like Dan's patch yours too will not be effective unless osd snap trim sleep > 0? > Queued for testing now. > Once that passes we can backport and test for firefly and dumpling too. > > Note that this won't make the next dumpling or firefly point releases > (which are imminent). Should be in the next ones, though. OK, just in case anyone else runs into problems after removing tons of snapshots with <=0.67.11, what's the plan to get them going again until 0.67.12 comes out? Install the autobuild package from the wip branch? Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Sam and I discussed this on IRC and have we think two simpler patches that solve the problem more directly. See wip-9487. Queued for testing now. Once that passes we can backport and test for firefly and dumpling too. Note that this won't make the next dumpling or firefly point releases (which are imminent). Should be in the next ones, though. Upside is it looks like Sam found #9113 (snaptrimmer memory leak) at the same time, yay! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas wrote: > On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas wrote: >> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil wrote: >>> On Sun, 21 Sep 2014, Florian Haas wrote: So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. >>> >>> Yeah, it sounds like a separate issue, and no, the limit is not >>> documented because it's definitely not the intended behavior. :) >>> >>> ...and I see you already have a log attached to #9503. Will take a look. >> >> I've already updated that issue in Redmine, but for the list archives >> I should also add this here: Dan's patch for #9503, together with >> Sage's for #9487, makes the problem go away in an instant. I've >> already pointed out that I owe Dan dinner, and Sage, well I already >> owe Sage pretty much lifelong full board. :) > > Looks like I was bit too eager: while the cluster is behaving nicely > with these patches while nothing happens to any OSDs, it does flag PGs > as incomplete when an OSD goes down. Once the mon osd down out > interval expires things seem to recover/backfill normally, but it's > still disturbing to see this in the interim. > > I've updated http://tracker.ceph.com/issues/9503 with a pg query from > one of the affected PGs, within the mon osd down out interval, while > it was marked incomplete. > > Dan or Sage, any ideas as to what might be causing this? That *looks* like it's just because the pool has both size and min_size set to 2? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas wrote: > On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil wrote: >> On Sun, 21 Sep 2014, Florian Haas wrote: >>> So yes, I think your patch absolutely still has merit, as would any >>> means of reducing the number of snapshots an OSD will trim in one go. >>> As it is, the situation looks really really bad, specifically >>> considering that RBD and RADOS are meant to be super rock solid, as >>> opposed to say CephFS which is in an experimental state. And contrary >>> to CephFS snapshots, I can't recall any documentation saying that RBD >>> snapshots will break your system. >> >> Yeah, it sounds like a separate issue, and no, the limit is not >> documented because it's definitely not the intended behavior. :) >> >> ...and I see you already have a log attached to #9503. Will take a look. > > I've already updated that issue in Redmine, but for the list archives > I should also add this here: Dan's patch for #9503, together with > Sage's for #9487, makes the problem go away in an instant. I've > already pointed out that I owe Dan dinner, and Sage, well I already > owe Sage pretty much lifelong full board. :) Looks like I was bit too eager: while the cluster is behaving nicely with these patches while nothing happens to any OSDs, it does flag PGs as incomplete when an OSD goes down. Once the mon osd down out interval expires things seem to recover/backfill normally, but it's still disturbing to see this in the interim. I've updated http://tracker.ceph.com/issues/9503 with a pg query from one of the affected PGs, within the mon osd down out interval, while it was marked incomplete. Dan or Sage, any ideas as to what might be causing this? Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil wrote: > On Sun, 21 Sep 2014, Florian Haas wrote: >> So yes, I think your patch absolutely still has merit, as would any >> means of reducing the number of snapshots an OSD will trim in one go. >> As it is, the situation looks really really bad, specifically >> considering that RBD and RADOS are meant to be super rock solid, as >> opposed to say CephFS which is in an experimental state. And contrary >> to CephFS snapshots, I can't recall any documentation saying that RBD >> snapshots will break your system. > > Yeah, it sounds like a separate issue, and no, the limit is not > documented because it's definitely not the intended behavior. :) > > ...and I see you already have a log attached to #9503. Will take a look. I've already updated that issue in Redmine, but for the list archives I should also add this here: Dan's patch for #9503, together with Sage's for #9487, makes the problem go away in an instant. I've already pointed out that I owe Dan dinner, and Sage, well I already owe Sage pretty much lifelong full board. :) Everyone with a ton of snapshots in their clusters (not sure where the threshold is, but it gets nasty somewhere between 1,000 and 10,000 I imagine) should probably update to 0.67.11 and 0.80.6 as soon as they come out, otherwise Terrible Things Will Happen™ if you're ever forced to delete a large number of snaps at once. Thanks again to Dan and Sage, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sun, 21 Sep 2014, Florian Haas wrote: > So yes, I think your patch absolutely still has merit, as would any > means of reducing the number of snapshots an OSD will trim in one go. > As it is, the situation looks really really bad, specifically > considering that RBD and RADOS are meant to be super rock solid, as > opposed to say CephFS which is in an experimental state. And contrary > to CephFS snapshots, I can't recall any documentation saying that RBD > snapshots will break your system. Yeah, it sounds like a separate issue, and no, the limit is not documented because it's definitely not the intended behavior. :) ...and I see you already have a log attached to #9503. Will take a look. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sat, 20 Sep 2014, Alphe Salas wrote: > Real field testings and proof workout are better than any unit testing ... I > would follow Dan s notice of resolution because it based on real problem and > not fony style test ground. It's been reviewed and look right, but the rados torture tests are pretty ... torturous, and this code is delicate. I would still wait. > Sage apart that problem is there a solution to the ever expending replicas > problem ? Discard for the kernel RBD client should go upstream this cycle. As for RADOS consuming more data when RBD blocks are overwritten, I still have yet to see any actual evidence of this, and have a hard time seeing how it could happen. A sequence of steps to reproduce would be the next step. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster wrote: > Hi Florian, > > September 21 2014 3:33 PM, "Florian Haas" wrote: >> That said, I'm not sure that wip-9487-dumpling is the final fix to the >> issue. On the system where I am seeing the issue, even with the fix >> deployed, osd's still not only go crazy snap trimming (which by itself >> would be understandable, as the system has indeed recently had >> thousands of snapshots removed), but they also still produce the >> previously seen ENOENT messages indicating they're trying to trim >> snaps that aren't there. >> > > You should be able to tell exactly how many snaps need to be trimmed. Check > the current purged_snaps with > > ceph pg x.y query > > and also check the snap_trimq from debug_osd=10. The problem fixed in > wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in > your cluster purged_snaps is "correct" (which it should be after the fix from > Sage), and it still has lots of snaps to trim, then I believe the only thing > to do is let those snaps all get trimmed. (my other patch linked sometime > earlier in this thread might help by breaking up all that trimming work into > smaller pieces, but that was never tested). Yes, it does indeed look like the system does have thousands of snapshots left to trim. That said, since the PGs are locked during this time, this creates a situation where the cluster is becoming unusable with no way for the user to recover. > Entering the realm of speculation, I wonder if your OSDs are getting > interrupted, marked down, out, or crashing before they have the opportunity > to persist purged_snaps? purged_snaps is updated in > ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to > actually send that transaction to its peers, so then eventually it or the new > primary needs to start again, and no progress is ever made. If this is what > is happening on your cluster, then again, perhaps my osd_snap_trim_max patch > could be a solution. Since the snap trimmer immediately jacks the affected OSDs up to 100% CPU utilization, and they stop even responding to heartbeats, yes they do get marked down and that makes the issue much worse. Even when setting nodown, though, then that doesn't change the fact that the affected OSDs just spin practically indefinitely. So, even with the patch for 9487, which fixes *your* issue of the cluster trying to trim tons of snaps when in fact it should be trimming only a handful, the user is still in a world of pain when they do indeed have tons of snaps to trim. And obviously, neither of osd max backfills nor osd recovery max active help here, because even a single backfill/recovery makes the OSD go nuts. There is the silly option of setting osd_snap_trim_sleep to say 61 minutes, and restarting the ceph-osd daemons before the snap trim can kick in, i.e. hourly, via a cron job. Of course, while this prevents the OSD from going into a death spin, it only perpetuates the problem until a patch for this issue is available, because snap trimming never even runs, let alone completes. This is particularly bad because a user can get themselves a non-functional cluster simply by trying to delete thousands of snapshots at once. If you consider a tiny virtualization cluster of just 100 persistent VMs, out of which you take one snapshot an hour, then deleting the snapshots taken in one month puts you well above that limit. So we're not talking about outrageous numbers here. I don't think anyone can fault any user for attempting this. What makes the situation even worse is that there is no cluster-wide limit to the number of snapshots, or even say snapshots per RBD volume, or snapshots per PG, nor any limit on the number of snapshots deleted concurrently. So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Florian, September 21 2014 3:33 PM, "Florian Haas" wrote: > That said, I'm not sure that wip-9487-dumpling is the final fix to the > issue. On the system where I am seeing the issue, even with the fix > deployed, osd's still not only go crazy snap trimming (which by itself > would be understandable, as the system has indeed recently had > thousands of snapshots removed), but they also still produce the > previously seen ENOENT messages indicating they're trying to trim > snaps that aren't there. > You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with ceph pg x.y query and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested). Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution. Cheers, Dan > That system, however, has PGs marked as recovering, not backfilling as > in Dan's system. Not sure if wip-9487 falls short of fixing the issue > at its root. Sage, whenever you have time, would you mind commenting? > > Cheers, > Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sat, Sep 20, 2014 at 9:08 PM, Alphe Salas wrote: > Real field testings and proof workout are better than any unit testing ... I > would follow Dan s notice of resolution because it based on real problem and > not fony style test ground. That statement is almost an insult to the authors and maintainers of the testing framework around Ceph. Therefore, I'm taking the liberty to register my objection. That said, I'm not sure that wip-9487-dumpling is the final fix to the issue. On the system where I am seeing the issue, even with the fix deployed, osd's still not only go crazy snap trimming (which by itself would be understandable, as the system has indeed recently had thousands of snapshots removed), but they also still produce the previously seen ENOENT messages indicating they're trying to trim snaps that aren't there. That system, however, has PGs marked as recovering, not backfilling as in Dan's system. Not sure if wip-9487 falls short of fixing the issue at its root. Sage, whenever you have time, would you mind commenting? Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
September 19 2014 5:19 PM, "Sage Weil" wrote: > On Fri, 19 Sep 2014, Dan van der Ster wrote: > >> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster >> wrote: On 19 Sep 2014, at 08:12, Florian Haas wrote: On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: > On Fri, 19 Sep 2014, Florian Haas wrote: >> Hi Sage, >> >> was the off-list reply intentional? > > Whoops! Nope :) > >> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: So, disaster is a pretty good description. Would anyone from the core team like to suggest another course of action or workaround, or are Dan and I generally on the right track to make the best out of a pretty bad situation? >>> >>> The short term fix would probably be to just prevent backfill for the >>> time >>> being until the bug is fixed. >> >> As in, osd max backfills = 0? > > Yeah :) > > Just managed to reproduce the problem... > > sage Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks! >>> >>> Indeed :) Thanks Sage! >>> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now? >> >> Final update, after 4 hours in prod and after draining 8 OSDs -- zero >> slow requests :) > > That's great news! > > But, please be careful. This code hasn't been reiewed yet or been through > any testing! I would hold off on further backfills until it's merged. Roger; I've been watching it very closely and so far it seems to work very well. Looking forward to that merge :) Cheers, Dan > > Thanks! > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, 19 Sep 2014, Dan van der Ster wrote: > On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster > wrote: > >> On 19 Sep 2014, at 08:12, Florian Haas wrote: > >> > >> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: > >>> On Fri, 19 Sep 2014, Florian Haas wrote: > Hi Sage, > > was the off-list reply intentional? > >>> > >>> Whoops! Nope :) > >>> > On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: > >> So, disaster is a pretty good description. Would anyone from the core > >> team like to suggest another course of action or workaround, or are > >> Dan and I generally on the right track to make the best out of a > >> pretty bad situation? > > > > The short term fix would probably be to just prevent backfill for the > > time > > being until the bug is fixed. > > As in, osd max backfills = 0? > >>> > >>> Yeah :) > >>> > >>> Just managed to reproduce the problem... > >>> > >>> sage > >> > >> Saw the wip branch. Color me freakishly impressed on the turnaround. :) > >> Thanks! > > > > Indeed :) Thanks Sage! > > wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now? > > Final update, after 4 hours in prod and after draining 8 OSDs -- zero > slow requests :) That's great news! But, please be careful. This code hasn't been reiewed yet or been through any testing! I would hold off on further backfills until it's merged. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster wrote: >> On 19 Sep 2014, at 08:12, Florian Haas wrote: >> >> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: >>> On Fri, 19 Sep 2014, Florian Haas wrote: Hi Sage, was the off-list reply intentional? >>> >>> Whoops! Nope :) >>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: >> So, disaster is a pretty good description. Would anyone from the core >> team like to suggest another course of action or workaround, or are >> Dan and I generally on the right track to make the best out of a >> pretty bad situation? > > The short term fix would probably be to just prevent backfill for the time > being until the bug is fixed. As in, osd max backfills = 0? >>> >>> Yeah :) >>> >>> Just managed to reproduce the problem... >>> >>> sage >> >> Saw the wip branch. Color me freakishly impressed on the turnaround. :) >> Thanks! > > Indeed :) Thanks Sage! > wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now… Final update, after 4 hours in prod and after draining 8 OSDs -- zero slow requests :) Thanks again! Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
> On 19 Sep 2014, at 08:12, Florian Haas wrote: > > On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: >> On Fri, 19 Sep 2014, Florian Haas wrote: >>> Hi Sage, >>> >>> was the off-list reply intentional? >> >> Whoops! Nope :) >> >>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: > So, disaster is a pretty good description. Would anyone from the core > team like to suggest another course of action or workaround, or are > Dan and I generally on the right track to make the best out of a > pretty bad situation? The short term fix would probably be to just prevent backfill for the time being until the bug is fixed. >>> >>> As in, osd max backfills = 0? >> >> Yeah :) >> >> Just managed to reproduce the problem... >> >> sage > > Saw the wip branch. Color me freakishly impressed on the turnaround. :) > Thanks! Indeed :) Thanks Sage! wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now… Cheers, DanN�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil wrote: > On Fri, 19 Sep 2014, Florian Haas wrote: >> Hi Sage, >> >> was the off-list reply intentional? > > Whoops! Nope :) > >> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: >> >> So, disaster is a pretty good description. Would anyone from the core >> >> team like to suggest another course of action or workaround, or are >> >> Dan and I generally on the right track to make the best out of a >> >> pretty bad situation? >> > >> > The short term fix would probably be to just prevent backfill for the time >> > being until the bug is fixed. >> >> As in, osd max backfills = 0? > > Yeah :) > > Just managed to reproduce the problem... > > sage Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks! Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, 19 Sep 2014, Florian Haas wrote: > Hi Sage, > > was the off-list reply intentional? Whoops! Nope :) > On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil wrote: > >> So, disaster is a pretty good description. Would anyone from the core > >> team like to suggest another course of action or workaround, or are > >> Dan and I generally on the right track to make the best out of a > >> pretty bad situation? > > > > The short term fix would probably be to just prevent backfill for the time > > being until the bug is fixed. > > As in, osd max backfills = 0? Yeah :) Just managed to reproduce the problem... sage > > The root of the problem seems to be that it is trying to trim snaps that > > aren't there. I'm trying to reproduce the issue now! Hopefully the fix > > is simple... > > > > http://tracker.ceph.com/issues/9487 > > > > Thanks! > > sage > > Thanks. :) > > Cheers, > Florian > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster wrote: > Hi, > > September 18 2014 9:03 PM, "Florian Haas" wrote: >> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster >> wrote: >> >>> Hi Florian, >>> >>> On Sep 18, 2014 7:03 PM, Florian Haas wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster wrote: >>> Do I understand your issue report correctly in that you have found >>> setting osd_snap_trim_sleep to be ineffective, because it's being >>> applied when iterating from PG to PG, rather than from snap to snap? >>> If so, then I'm guessing that that can hardly be intentional… > > > I’m beginning to agree with you on that guess. AFAICT, the normal > behavior of the snap trimmer >>> is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs >>> at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. >>> >>> True, as am I. The current sleep is useful in this case, but since we'd >>> normally only expect up >> to >>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs >>> would finish rather >>> quickly anyway. Latency would surely be increased momentarily, but I >>> wouldn't expect 90s slow >>> requests like I have with the 3 snap_trimq single PG. >>> >>> Possibly the sleep is useful in both places. >>> > We could move the snap trim sleep into the SnapTrimmer state machine, for > example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to >> the sleep. > > To solve that we could limit the number of trims per instance of the > SnapTrimmer, like I’ve >> done in this pull req: https://github.com/ceph/ceph/pull/2516 > Breaking out of the trimmer like that should allow IOs to the trimming PG > to get through. > > The second aspect of this issue is why are the purged_snaps being lost to > begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. >>> of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature >>> of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see "adding snap X to purged_snaps", but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. >>> >>> With such a long snap_trimq there in your log, I suspect you're seeing the >>> exact same behavior as >> I >>> am. In my case the first snap trimmed is snap 1, of course because that is >>> the first rm'd snap, >> and >>> the contents of your pool are surely different. I also see the ENOENT >>> messages... again >> confirming >>> those snaps were already trimmed. Anyway, what I've observed is that a >>> large snap_trimq like that >>> will block the OSD until they are all re-trimmed. >> >> That's... a mess. >> >> So what is your workaround for recovery? My hunch would be to >> >> - stop all access to the cluster; >> - set nodown and noout so that other OSDs don't mark spinning OSDs >> down (which would cause all sorts of primary and PG reassignments, >> useless backfill/recovery when mon osd down out interval expires, >> etc.); >> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 >> so that at least *between* PGs, the OSD has a chance to respond to >> heartbeats and do whatever else it needs to do; >> - let the snap trim play itself out over several hours (days?). >> > > What I've been doing is I just continue draining my OSDs, two at a time. Each > time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour > it takes to drain) while a single PG re-trims, leading to ~100 slow requests. > The OSD must still be responding to the peer pings, since other OSDs do not > mark it down. Luckily this doesn't happen with every single movement of our > pool 5 PGs, otherwise it would be a disaster like you said. So just to clarify, what you're doing is out of the
Re: snap_trimming + backfilling is inefficient with many purged_snaps
-- Dan van der Ster || Data & Storage Services || CERN IT Department -- September 18 2014 9:12 PM, "Dan van der Ster" wrote: > Hi, > > September 18 2014 9:03 PM, "Florian Haas" wrote: > >> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster >> wrote: >> >>> Hi Florian, >>> >>> On Sep 18, 2014 7:03 PM, Florian Haas wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster wrote: >>> Do I understand your issue report correctly in that you have found >>> setting osd_snap_trim_sleep to be ineffective, because it's being >>> applied when iterating from PG to PG, rather than from snap to snap? >>> If so, then I'm guessing that that can hardly be intentional… > > I’m beginning to agree with you on that guess. AFAICT, the normal > behavior of the snap trimmer >>> >>> is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs >>> >>> at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. >>> >>> True, as am I. The current sleep is useful in this case, but since we'd >>> normally only expect up >> >> to >>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs >>> would finish rather >>> quickly anyway. Latency would surely be increased momentarily, but I >>> wouldn't expect 90s slow >>> requests like I have with the 3 snap_trimq single PG. >>> >>> Possibly the sleep is useful in both places. >>> > We could move the snap trim sleep into the SnapTrimmer state machine, for > example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to >> >> the sleep. > To solve that we could limit the number of trims per instance of the > SnapTrimmer, like I’ve >> >> done in this pull req: https://github.com/ceph/ceph/pull/2516 > Breaking out of the trimmer like that should allow IOs to the trimming PG > to get through. > > The second aspect of this issue is why are the purged_snaps being lost to > begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps > (e.g. >>> >>> of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one > signature >>> >>> of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see "adding snap X to purged_snaps", but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. >>> >>> With such a long snap_trimq there in your log, I suspect you're seeing the >>> exact same behavior > as >> >> I >>> am. In my case the first snap trimmed is snap 1, of course because that is >>> the first rm'd snap, >> >> and >>> the contents of your pool are surely different. I also see the ENOENT >>> messages... again >> >> confirming >>> those snaps were already trimmed. Anyway, what I've observed is that a >>> large snap_trimq like > that >>> will block the OSD until they are all re-trimmed. >> >> That's... a mess. >> >> So what is your workaround for recovery? My hunch would be to >> >> - stop all access to the cluster; >> - set nodown and noout so that other OSDs don't mark spinning OSDs >> down (which would cause all sorts of primary and PG reassignments, >> useless backfill/recovery when mon osd down out interval expires, >> etc.); >> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 >> so that at least *between* PGs, the OSD has a chance to respond to >> heartbeats and do whatever else it needs to do; >> - let the snap trim play itself out over several hours (days?). > > What I've been doing is I just continue draining my OSDs, two at a time. Each > time, 1-2 other OSDs > become blocked for a couple minutes (out of the ~1 hour it takes to drain) > while a single PG > re-trims, leading to ~100 slow requests. The OSD must still be responding to > the peer pings, since > other OSDs do not mark it down. Luckily this doesn't
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi, September 18 2014 9:03 PM, "Florian Haas" wrote: > On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster > wrote: > >> Hi Florian, >> >> On Sep 18, 2014 7:03 PM, Florian Haas wrote: >>> Hi Dan, >>> >>> saw the pull request, and can confirm your observations, at least >>> partially. Comments inline. >>> >>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster >>> wrote: >> Do I understand your issue report correctly in that you have found >> setting osd_snap_trim_sleep to be ineffective, because it's being >> applied when iterating from PG to PG, rather than from snap to snap? >> If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer >> is >>> to trim one single snap, the one which is in the snap_trimq but not yet in >>> purged_snaps. So the >>> only time the current sleep implementation could be useful is if we rm’d a >>> snap across many PGs >> at >>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem >>> anyway since you’d at >>> most need to trim O(100) PGs. >>> >>> Hmm. I'm actually seeing this in a system where the problematic snaps >>> could *only* have been RBD snaps. >> >> True, as am I. The current sleep is useful in this case, but since we'd >> normally only expect up > to >> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs >> would finish rather >> quickly anyway. Latency would surely be increased momentarily, but I >> wouldn't expect 90s slow >> requests like I have with the 3 snap_trimq single PG. >> >> Possibly the sleep is useful in both places. >> We could move the snap trim sleep into the SnapTrimmer state machine, for example in >>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get >>> through to the OSD, but of >>> course the trimming PG would remain locked. And it would be locked for even >>> longer now due to > the >>> sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve > done >>> in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve >>> managed to reproduce that on my test cluster. All you have to do is create >>> many pool snaps (e.g. >> of >>> a nearly empty pool), then rmsnap all those snapshots. Then use crush >>> reweight to move the PGs >>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, >>> which is one signature >> of >>> this lost purged_snaps issue. To reproduce slow requests the number of >>> snaps purged needs to be >>> O(1). >>> >>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to >>> purged_snaps", but only after the snap has been purged. See >>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the >>> fact that the OSD tries to trim a snap only to get an ENOENT is >>> probably indicative of something being fishy with the snaptrimq and/or >>> the purged_snaps list as well. >> >> With such a long snap_trimq there in your log, I suspect you're seeing the >> exact same behavior as > I >> am. In my case the first snap trimmed is snap 1, of course because that is >> the first rm'd snap, > and >> the contents of your pool are surely different. I also see the ENOENT >> messages... again > confirming >> those snaps were already trimmed. Anyway, what I've observed is that a large >> snap_trimq like that >> will block the OSD until they are all re-trimmed. > > That's... a mess. > > So what is your workaround for recovery? My hunch would be to > > - stop all access to the cluster; > - set nodown and noout so that other OSDs don't mark spinning OSDs > down (which would cause all sorts of primary and PG reassignments, > useless backfill/recovery when mon osd down out interval expires, > etc.); > - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 > so that at least *between* PGs, the OSD has a chance to respond to > heartbeats and do whatever else it needs to do; > - let the snap trim play itself out over several hours (days?). > What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said. Cheers, Dan > That sounds utterly awful, but if anyone has a better idea (other than > "wait until the patch is merged"), I'd be all ears. > > Cheers > Florian -- To unsubscribe from this list: send the line "unsubs
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour wrote: > Hi Florian, > > On Sep 18, 2014 7:03 PM, Florian Haas wrote: >> >> Hi Dan, >> >> saw the pull request, and can confirm your observations, at least >> partially. Comments inline. >> >> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster >> wrote: >> >>> Do I understand your issue report correctly in that you have found >> >>> setting osd_snap_trim_sleep to be ineffective, because it's being >> >>> applied when iterating from PG to PG, rather than from snap to snap? >> >>> If so, then I'm guessing that that can hardly be intentional… >> > >> > >> > I’m beginning to agree with you on that guess. AFAICT, the normal behavior >> > of the snap trimmer is to trim one single snap, the one which is in the >> > snap_trimq but not yet in purged_snaps. So the only time the current sleep >> > implementation could be useful is if we rm’d a snap across many PGs at >> > once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem >> > anyway since you’d at most need to trim O(100) PGs. >> >> Hmm. I'm actually seeing this in a system where the problematic snaps >> could *only* have been RBD snaps. >> > > True, as am I. The current sleep is useful in this case, but since we'd > normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap > across all of those PGs would finish rather quickly anyway. Latency would > surely be increased momentarily, but I wouldn't expect 90s slow requests like > I have with the 3 snap_trimq single PG. > > Possibly the sleep is useful in both places. > >> > We could move the snap trim sleep into the SnapTrimmer state machine, for >> > example in ReplicatedPG::NotTrimming::react. This should allow other IOs >> > to get through to the OSD, but of course the trimming PG would remain >> > locked. And it would be locked for even longer now due to the sleep. >> > >> > To solve that we could limit the number of trims per instance of the >> > SnapTrimmer, like I’ve done in this pull req: >> > https://github.com/ceph/ceph/pull/2516 >> > Breaking out of the trimmer like that should allow IOs to the trimming PG >> > to get through. >> > >> > The second aspect of this issue is why are the purged_snaps being lost to >> > begin with. I’ve managed to reproduce that on my test cluster. All you >> > have to do is create many pool snaps (e.g. of a nearly empty pool), then >> > rmsnap all those snapshots. Then use crush reweight to move the PGs >> > around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, >> > which is one signature of this lost purged_snaps issue. To reproduce slow >> > requests the number of snaps purged needs to be O(1). >> >> Hmmm, I'm not sure if I confirm that. I see "adding snap X to >> purged_snaps", but only after the snap has been purged. See >> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the >> fact that the OSD tries to trim a snap only to get an ENOENT is >> probably indicative of something being fishy with the snaptrimq and/or >> the purged_snaps list as well. >> > > With such a long snap_trimq there in your log, I suspect you're seeing the > exact same behavior as I am. In my case the first snap trimmed is snap 1, of > course because that is the first rm'd snap, and the contents of your pool are > surely different. I also see the ENOENT messages... again confirming those > snaps were already trimmed. Anyway, what I've observed is that a large > snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). That sounds utterly awful, but if anyone has a better idea (other than "wait until the patch is merged"), I'd be all ears. Cheers Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster wrote: >>> Do I understand your issue report correctly in that you have found >>> setting osd_snap_trim_sleep to be ineffective, because it's being >>> applied when iterating from PG to PG, rather than from snap to snap? >>> If so, then I'm guessing that that can hardly be intentional… > > > I’m beginning to agree with you on that guess. AFAICT, the normal behavior of > the snap trimmer is to trim one single snap, the one which is in the > snap_trimq but not yet in purged_snaps. So the only time the current sleep > implementation could be useful is if we rm’d a snap across many PGs at once, > e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway > since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. > We could move the snap trim sleep into the SnapTrimmer state machine, for > example in ReplicatedPG::NotTrimming::react. This should allow other IOs to > get through to the OSD, but of course the trimming PG would remain locked. > And it would be locked for even longer now due to the sleep. > > To solve that we could limit the number of trims per instance of the > SnapTrimmer, like I’ve done in this pull req: > https://github.com/ceph/ceph/pull/2516 > Breaking out of the trimmer like that should allow IOs to the trimming PG to > get through. > > The second aspect of this issue is why are the purged_snaps being lost to > begin with. I’ve managed to reproduce that on my test cluster. All you have > to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap > all those snapshots. Then use crush reweight to move the PGs around. With > debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one > signature of this lost purged_snaps issue. To reproduce slow requests the > number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see "adding snap X to purged_snaps", but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. > Looking forward to any ideas someone might have. So am I. :) Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
snap_trimming + backfilling is inefficient with many purged_snaps
(moving this discussion to -devel) > Begin forwarded message: > > From: Florian Haas > Date: 17 Sep 2014 18:02:09 CEST > Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU > To: Dan Van Der Ster > Cc: Craig Lewis , "ceph-us...@lists.ceph.com" > > > On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster > wrote: >> From: Florian Haas >> Sent: Sep 17, 2014 5:33 PM >> To: Dan Van Der Ster >> Cc: Craig Lewis ;ceph-us...@lists.ceph.com >> Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU >> >> On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster >> wrote: >>> Hi Florian, >>> On 17 Sep 2014, at 17:09, Florian Haas wrote: Hi Craig, just dug this up in the list archives. On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis wrote: > In the interest of removing variables, I removed all snapshots on all > pools, > then restarted all ceph daemons at the same time. This brought up osd.8 > as > well. So just to summarize this: your 100% CPU problem at the time went away after you removed all snapshots, and the actual cause of the issue was never found? I am seeing a similar issue now, and have filed http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost again. Can you take a look at that issue and let me know if anything in the description sounds familiar? >>> >>> >>> Could your ticket be related to the snap trimming issue I’ve finally >>> narrowed down in the past couple days? >>> >>> http://tracker.ceph.com/issues/9487 >>> >>> Bump up debug_osd to 20 then check the log during one of your incidents. >>> If it is busy logging the snap_trimmer messages, then it’s the same issue. >>> (The issue is that rbd pools have many purged_snaps, but sometimes after >>> backfilling a PG the purged_snaps list is lost and thus the snap trimmer >>> becomes very busy whilst re-trimming thousands of snaps. During that time (a >>> few minutes on my cluster) the OSD is blocked.) >> >> That sounds promising, thank you! debug_osd=10 should actually be >> sufficient as those snap_trim messages get logged at that level. :) >> >> Do I understand your issue report correctly in that you have found >> setting osd_snap_trim_sleep to be ineffective, because it's being >> applied when iterating from PG to PG, rather than from snap to snap? >> If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Looking forward to any ideas someone might have. Cheers, Dan N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i