Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-10-16 Thread Gregory Farnum
On Thu, Oct 16, 2014 at 2:04 AM, Florian Haas  wrote:
> Hi Greg,
>
> sorry, this somehow got stuck in my drafts folder.
>
> On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum  wrote:
>> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas  wrote:
>>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas  wrote:
 On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil  wrote:
> On Sun, 21 Sep 2014, Florian Haas wrote:
>> So yes, I think your patch absolutely still has merit, as would any
>> means of reducing the number of snapshots an OSD will trim in one go.
>> As it is, the situation looks really really bad, specifically
>> considering that RBD and RADOS are meant to be super rock solid, as
>> opposed to say CephFS which is in an experimental state. And contrary
>> to CephFS snapshots, I can't recall any documentation saying that RBD
>> snapshots will break your system.
>
> Yeah, it sounds like a separate issue, and no, the limit is not
> documented because it's definitely not the intended behavior. :)
>
> ...and I see you already have a log attached to #9503.  Will take a look.

 I've already updated that issue in Redmine, but for the list archives
 I should also add this here: Dan's patch for #9503, together with
 Sage's for #9487, makes the problem go away in an instant. I've
 already pointed out that I owe Dan dinner, and Sage, well I already
 owe Sage pretty much lifelong full board. :)
>>>
>>> Looks like I was bit too eager: while the cluster is behaving nicely
>>> with these patches while nothing happens to any OSDs, it does flag PGs
>>> as incomplete when an OSD goes down. Once the mon osd down out
>>> interval expires things seem to recover/backfill normally, but it's
>>> still disturbing to see this in the interim.
>>>
>>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>>> one of the affected PGs, within the mon osd down out interval, while
>>> it was marked incomplete.
>>>
>>> Dan or Sage, any ideas as to what might be causing this?
>>
>> That *looks* like it's just because the pool has both size and
>> min_size set to 2?
>
> Correct. But the documentation did not reflect that this is a
> perfectly expected side effect of having min_size > 1.
>
> pg-states.rst says:
>
> *Incomplete*
>   Ceph detects that a placement group is missing a necessary period of history
>   from its log.  If you see this state, report a bug, and try to start any
>   failed OSDs that may contain the needed information.
>
> So if min_size > 1 and replicas < min_size, then the incomplete state
> is not a bug but a perfectly expected occurrence, correct?
>
> It's still a bit weird in that the PG seems to behave differently
> depending on min_size. If min_size == 1 (default), then a PG with no
> remaining replicas is stale, unless a replica failed first and the
> primary was written to, after which it also failed, and the replica
> then comes up and can't go primary because it now has outdated data,
> in which case the PG goes "down". It never goes "incomplete".
>
> So is the documentation wrong, or is there something fishy with the
> reported state of the PGs?

I guess the documentation is wrong, although I thought we'd fixed that
particular one. :/ Giant actually distinguishes between these
conditions by adding an "undersized" state to the PG, so it'll be
easier to diagnose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-10-16 Thread Florian Haas
Hi Greg,

sorry, this somehow got stuck in my drafts folder.

On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum  wrote:
> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas  wrote:
>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas  wrote:
>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil  wrote:
 On Sun, 21 Sep 2014, Florian Haas wrote:
> So yes, I think your patch absolutely still has merit, as would any
> means of reducing the number of snapshots an OSD will trim in one go.
> As it is, the situation looks really really bad, specifically
> considering that RBD and RADOS are meant to be super rock solid, as
> opposed to say CephFS which is in an experimental state. And contrary
> to CephFS snapshots, I can't recall any documentation saying that RBD
> snapshots will break your system.

 Yeah, it sounds like a separate issue, and no, the limit is not
 documented because it's definitely not the intended behavior. :)

 ...and I see you already have a log attached to #9503.  Will take a look.
>>>
>>> I've already updated that issue in Redmine, but for the list archives
>>> I should also add this here: Dan's patch for #9503, together with
>>> Sage's for #9487, makes the problem go away in an instant. I've
>>> already pointed out that I owe Dan dinner, and Sage, well I already
>>> owe Sage pretty much lifelong full board. :)
>>
>> Looks like I was bit too eager: while the cluster is behaving nicely
>> with these patches while nothing happens to any OSDs, it does flag PGs
>> as incomplete when an OSD goes down. Once the mon osd down out
>> interval expires things seem to recover/backfill normally, but it's
>> still disturbing to see this in the interim.
>>
>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>> one of the affected PGs, within the mon osd down out interval, while
>> it was marked incomplete.
>>
>> Dan or Sage, any ideas as to what might be causing this?
>
> That *looks* like it's just because the pool has both size and
> min_size set to 2?

Correct. But the documentation did not reflect that this is a
perfectly expected side effect of having min_size > 1.

pg-states.rst says:

*Incomplete*
  Ceph detects that a placement group is missing a necessary period of history
  from its log.  If you see this state, report a bug, and try to start any
  failed OSDs that may contain the needed information.

So if min_size > 1 and replicas < min_size, then the incomplete state
is not a bug but a perfectly expected occurrence, correct?

It's still a bit weird in that the PG seems to behave differently
depending on min_size. If min_size == 1 (default), then a PG with no
remaining replicas is stale, unless a replica failed first and the
primary was written to, after which it also failed, and the replica
then comes up and can't go primary because it now has outdated data,
in which case the PG goes "down". It never goes "incomplete".

So is the documentation wrong, or is there something fishy with the
reported state of the PGs?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-10-15 Thread Samuel Just
It's in giant, the firefly backport will happen once we are happy with
the fallout from the 80.7 thing.
-Sam

On Wed, Oct 15, 2014 at 7:47 AM, Dan Van Der Ster
 wrote:
> Hi Sage,
>
>> On 19 Sep 2014, at 17:37, Dan Van Der Ster  wrote:
>>
>> September 19 2014 5:19 PM, "Sage Weil"  wrote:
>>> On Fri, 19 Sep 2014, Dan van der Ster wrote:
>>>
 On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
  wrote:
>> On 19 Sep 2014, at 08:12, Florian Haas  wrote:
>>
>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
>>> On Fri, 19 Sep 2014, Florian Haas wrote:
 Hi Sage,

 was the off-list reply intentional?
>>>
>>> Whoops! Nope :)
>>>
 On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
>> So, disaster is a pretty good description. Would anyone from the core
>> team like to suggest another course of action or workaround, or are
>> Dan and I generally on the right track to make the best out of a
>> pretty bad situation?
>
> The short term fix would probably be to just prevent backfill for the 
> time
> being until the bug is fixed.

 As in, osd max backfills = 0?
>>>
>>> Yeah :)
>>>
>>> Just managed to reproduce the problem...
>>>
>>> sage
>>
>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
>> Thanks!
>
> Indeed :) Thanks Sage!
> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod 
> now?

 Final update, after 4 hours in prod and after draining 8 OSDs -- zero
 slow requests :)
>>>
>>> That's great news!
>>>
>>> But, please be careful. This code hasn't been reiewed yet or been through
>>> any testing! I would hold off on further backfills until it's merged.
>
>
> Any news on those merges? It would be good to get this fixed on the dumpling 
> and firefly branches. We're kind of stuck at the moment :(
>
> Cheers, Dan
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-10-15 Thread Dan Van Der Ster
Hi Sage,

> On 19 Sep 2014, at 17:37, Dan Van Der Ster  wrote:
> 
> September 19 2014 5:19 PM, "Sage Weil"  wrote: 
>> On Fri, 19 Sep 2014, Dan van der Ster wrote:
>> 
>>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>>>  wrote:
> On 19 Sep 2014, at 08:12, Florian Haas  wrote:
> 
> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>> Hi Sage,
>>> 
>>> was the off-list reply intentional?
>> 
>> Whoops! Nope :)
>> 
>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
> So, disaster is a pretty good description. Would anyone from the core
> team like to suggest another course of action or workaround, or are
> Dan and I generally on the right track to make the best out of a
> pretty bad situation?
 
 The short term fix would probably be to just prevent backfill for the 
 time
 being until the bug is fixed.
>>> 
>>> As in, osd max backfills = 0?
>> 
>> Yeah :)
>> 
>> Just managed to reproduce the problem...
>> 
>> sage
> 
> Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
> Thanks!
 
 Indeed :) Thanks Sage!
 wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
>>> 
>>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
>>> slow requests :)
>> 
>> That's great news!
>> 
>> But, please be careful. This code hasn't been reiewed yet or been through
>> any testing! I would hold off on further backfills until it's merged.


Any news on those merges? It would be good to get this fixed on the dumpling 
and firefly branches. We're kind of stuck at the moment :(

Cheers, Dan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-24 Thread Florian Haas
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil  wrote:
> Sam and I discussed this on IRC and have we think two simpler patches that
> solve the problem more directly.  See wip-9487.

So I understand this makes Dan's patch (and the config parameter that
it introduces) unnecessary, but is it correct to assume that just like
Dan's patch yours too will not be effective unless osd snap trim sleep
> 0?

> Queued for testing now.
> Once that passes we can backport and test for firefly and dumpling too.
>
> Note that this won't make the next dumpling or firefly point releases
> (which are imminent).  Should be in the next ones, though.

OK, just in case anyone else runs into problems after removing tons of
snapshots with <=0.67.11, what's the plan to get them going again
until 0.67.12 comes out? Install the autobuild package from the wip
branch?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-23 Thread Sage Weil
Sam and I discussed this on IRC and have we think two simpler patches that 
solve the problem more directly.  See wip-9487.  Queued for testing now.  
Once that passes we can backport and test for firefly and dumpling too.

Note that this won't make the next dumpling or firefly point releases 
(which are imminent).  Should be in the next ones, though.

Upside is it looks like Sam found #9113 (snaptrimmer memory leak) at the 
same time, yay!

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-23 Thread Gregory Farnum
On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas  wrote:
> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas  wrote:
>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil  wrote:
>>> On Sun, 21 Sep 2014, Florian Haas wrote:
 So yes, I think your patch absolutely still has merit, as would any
 means of reducing the number of snapshots an OSD will trim in one go.
 As it is, the situation looks really really bad, specifically
 considering that RBD and RADOS are meant to be super rock solid, as
 opposed to say CephFS which is in an experimental state. And contrary
 to CephFS snapshots, I can't recall any documentation saying that RBD
 snapshots will break your system.
>>>
>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>> documented because it's definitely not the intended behavior. :)
>>>
>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>
>> I've already updated that issue in Redmine, but for the list archives
>> I should also add this here: Dan's patch for #9503, together with
>> Sage's for #9487, makes the problem go away in an instant. I've
>> already pointed out that I owe Dan dinner, and Sage, well I already
>> owe Sage pretty much lifelong full board. :)
>
> Looks like I was bit too eager: while the cluster is behaving nicely
> with these patches while nothing happens to any OSDs, it does flag PGs
> as incomplete when an OSD goes down. Once the mon osd down out
> interval expires things seem to recover/backfill normally, but it's
> still disturbing to see this in the interim.
>
> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
> one of the affected PGs, within the mon osd down out interval, while
> it was marked incomplete.
>
> Dan or Sage, any ideas as to what might be causing this?

That *looks* like it's just because the pool has both size and
min_size set to 2?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-23 Thread Florian Haas
On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas  wrote:
> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil  wrote:
>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>> So yes, I think your patch absolutely still has merit, as would any
>>> means of reducing the number of snapshots an OSD will trim in one go.
>>> As it is, the situation looks really really bad, specifically
>>> considering that RBD and RADOS are meant to be super rock solid, as
>>> opposed to say CephFS which is in an experimental state. And contrary
>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>> snapshots will break your system.
>>
>> Yeah, it sounds like a separate issue, and no, the limit is not
>> documented because it's definitely not the intended behavior. :)
>>
>> ...and I see you already have a log attached to #9503.  Will take a look.
>
> I've already updated that issue in Redmine, but for the list archives
> I should also add this here: Dan's patch for #9503, together with
> Sage's for #9487, makes the problem go away in an instant. I've
> already pointed out that I owe Dan dinner, and Sage, well I already
> owe Sage pretty much lifelong full board. :)

Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.

I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.

Dan or Sage, any ideas as to what might be causing this?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-22 Thread Florian Haas
On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil  wrote:
> On Sun, 21 Sep 2014, Florian Haas wrote:
>> So yes, I think your patch absolutely still has merit, as would any
>> means of reducing the number of snapshots an OSD will trim in one go.
>> As it is, the situation looks really really bad, specifically
>> considering that RBD and RADOS are meant to be super rock solid, as
>> opposed to say CephFS which is in an experimental state. And contrary
>> to CephFS snapshots, I can't recall any documentation saying that RBD
>> snapshots will break your system.
>
> Yeah, it sounds like a separate issue, and no, the limit is not
> documented because it's definitely not the intended behavior. :)
>
> ...and I see you already have a log attached to #9503.  Will take a look.

I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)

Everyone with a ton of snapshots in their clusters (not sure where the
threshold is, but it gets nasty somewhere between 1,000 and 10,000 I
imagine) should probably update to 0.67.11 and 0.80.6 as soon as they
come out, otherwise Terrible Things Will Happen™ if you're ever forced
to delete a large number of snaps at once.

Thanks again to Dan and Sage,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Sage Weil
On Sun, 21 Sep 2014, Florian Haas wrote:
> So yes, I think your patch absolutely still has merit, as would any
> means of reducing the number of snapshots an OSD will trim in one go.
> As it is, the situation looks really really bad, specifically
> considering that RBD and RADOS are meant to be super rock solid, as
> opposed to say CephFS which is in an experimental state. And contrary
> to CephFS snapshots, I can't recall any documentation saying that RBD
> snapshots will break your system.

Yeah, it sounds like a separate issue, and no, the limit is not 
documented because it's definitely not the intended behavior. :)

...and I see you already have a log attached to #9503.  Will take a look.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Sage Weil
On Sat, 20 Sep 2014, Alphe Salas wrote:
> Real field testings and proof workout are better than any unit testing ... I
> would follow Dan s notice of resolution because it based on real problem and
> not fony style test ground.

It's been reviewed and look right, but the rados torture tests are 
pretty ... torturous, and this code is delicate.  I would still wait.

> Sage apart that problem is there a solution to the ever expending replicas
> problem ?

Discard for the kernel RBD client should go upstream this cycle.

As for RADOS consuming more data when RBD blocks are overwritten, I still 
have yet to see any actual evidence of this, and have a hard time seeing 
how it could happen.  A sequence of steps to reproduce would be the next 
step.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Florian Haas
On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster
 wrote:
> Hi Florian,
>
> September 21 2014 3:33 PM, "Florian Haas"  wrote:
>> That said, I'm not sure that wip-9487-dumpling is the final fix to the
>> issue. On the system where I am seeing the issue, even with the fix
>> deployed, osd's still not only go crazy snap trimming (which by itself
>> would be understandable, as the system has indeed recently had
>> thousands of snapshots removed), but they also still produce the
>> previously seen ENOENT messages indicating they're trying to trim
>> snaps that aren't there.
>>
>
> You should be able to tell exactly how many snaps need to be trimmed. Check 
> the current purged_snaps with
>
> ceph pg x.y query
>
> and also check the snap_trimq from debug_osd=10. The problem fixed in 
> wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in 
> your cluster purged_snaps is "correct" (which it should be after the fix from 
> Sage), and it still has lots of snaps to trim, then I believe the only thing 
> to do is let those snaps all get trimmed. (my other patch linked sometime 
> earlier in this thread might help by breaking up all that trimming work into 
> smaller pieces, but that was never tested).

Yes, it does indeed look like the system does have thousands of
snapshots left to trim. That said, since the PGs are locked during
this time, this creates a situation where the cluster is becoming
unusable with no way for the user to recover.

> Entering the realm of speculation, I wonder if your OSDs are getting 
> interrupted, marked down, out, or crashing before they have the opportunity 
> to persist purged_snaps? purged_snaps is updated in 
> ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to 
> actually send that transaction to its peers, so then eventually it or the new 
> primary needs to start again, and no progress is ever made. If this is what 
> is happening on your cluster, then again, perhaps my osd_snap_trim_max patch 
> could be a solution.

Since the snap trimmer immediately jacks the affected OSDs up to 100%
CPU utilization, and they stop even responding to heartbeats, yes they
do get marked down and that makes the issue much worse. Even when
setting nodown, though, then that doesn't change the fact that the
affected OSDs just spin practically indefinitely.

So, even with the patch for 9487, which fixes *your* issue of the
cluster trying to trim tons of snaps when in fact it should be
trimming only a handful, the user is still in a world of pain when
they do indeed have tons of snaps to trim. And obviously, neither of
osd max backfills nor osd recovery max active help here, because even
a single backfill/recovery makes the OSD go nuts.

There is the silly option of setting osd_snap_trim_sleep to say 61
minutes, and restarting the ceph-osd daemons before the snap trim can
kick in, i.e. hourly, via a cron job. Of course, while this prevents
the OSD from going into a death spin, it only perpetuates the problem
until a patch for this issue is available, because snap trimming never
even runs, let alone completes.

This is particularly bad because a user can get themselves a
non-functional cluster simply by trying to delete thousands of
snapshots at once. If you consider a tiny virtualization cluster of
just 100 persistent VMs, out of which you take one snapshot an hour,
then deleting the snapshots taken in one month puts you well above
that limit. So we're not talking about outrageous numbers here. I
don't think anyone can fault any user for attempting this.

What makes the situation even worse is that there is no cluster-wide
limit to the number of snapshots, or even say snapshots per RBD
volume, or snapshots per PG, nor any limit on the number of snapshots
deleted concurrently.

So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Dan van der Ster
Hi Florian,

September 21 2014 3:33 PM, "Florian Haas"  wrote: 
> That said, I'm not sure that wip-9487-dumpling is the final fix to the
> issue. On the system where I am seeing the issue, even with the fix
> deployed, osd's still not only go crazy snap trimming (which by itself
> would be understandable, as the system has indeed recently had
> thousands of snapshots removed), but they also still produce the
> previously seen ENOENT messages indicating they're trying to trim
> snaps that aren't there.
> 

You should be able to tell exactly how many snaps need to be trimmed. Check the 
current purged_snaps with

ceph pg x.y query

and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 
is the (mis)communication of purged_snaps to a new OSD. But if in your cluster 
purged_snaps is "correct" (which it should be after the fix from Sage), and it 
still has lots of snaps to trim, then I believe the only thing to do is let 
those snaps all get trimmed. (my other patch linked sometime earlier in this 
thread might help by breaking up all that trimming work into smaller pieces, 
but that was never tested).

Entering the realm of speculation, I wonder if your OSDs are getting 
interrupted, marked down, out, or crashing before they have the opportunity to 
persist purged_snaps? purged_snaps is updated in 
ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to 
actually send that transaction to its peers, so then eventually it or the new 
primary needs to start again, and no progress is ever made. If this is what is 
happening on your cluster, then again, perhaps my osd_snap_trim_max patch could 
be a solution.

Cheers, Dan

> That system, however, has PGs marked as recovering, not backfilling as
> in Dan's system. Not sure if wip-9487 falls short of fixing the issue
> at its root. Sage, whenever you have time, would you mind commenting?
> 
> Cheers,
> Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Florian Haas
On Sat, Sep 20, 2014 at 9:08 PM, Alphe Salas  wrote:
> Real field testings and proof workout are better than any unit testing ... I
> would follow Dan s notice of resolution because it based on real problem and
> not fony style test ground.

That statement is almost an insult to the authors and maintainers of
the testing framework around Ceph. Therefore, I'm taking the liberty
to register my objection.

That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.

That system, however, has PGs marked as recovering, not backfilling as
in Dan's system. Not sure if wip-9487 falls short of fixing the issue
at its root. Sage, whenever you have time, would you mind commenting?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-19 Thread Dan van der Ster
September 19 2014 5:19 PM, "Sage Weil"  wrote: 
> On Fri, 19 Sep 2014, Dan van der Ster wrote:
> 
>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>>  wrote:
 On 19 Sep 2014, at 08:12, Florian Haas  wrote:
 
 On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
> On Fri, 19 Sep 2014, Florian Haas wrote:
>> Hi Sage,
>> 
>> was the off-list reply intentional?
> 
> Whoops! Nope :)
> 
>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
 So, disaster is a pretty good description. Would anyone from the core
 team like to suggest another course of action or workaround, or are
 Dan and I generally on the right track to make the best out of a
 pretty bad situation?
>>> 
>>> The short term fix would probably be to just prevent backfill for the 
>>> time
>>> being until the bug is fixed.
>> 
>> As in, osd max backfills = 0?
> 
> Yeah :)
> 
> Just managed to reproduce the problem...
> 
> sage
 
 Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
 Thanks!
>>> 
>>> Indeed :) Thanks Sage!
>>> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
>> 
>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
>> slow requests :)
> 
> That's great news!
> 
> But, please be careful. This code hasn't been reiewed yet or been through
> any testing! I would hold off on further backfills until it's merged.

Roger; I've been watching it very closely and so far it seems to work very 
well. Looking forward to that merge :)

Cheers, Dan


> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-19 Thread Sage Weil
On Fri, 19 Sep 2014, Dan van der Ster wrote:
> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>  wrote:
> >> On 19 Sep 2014, at 08:12, Florian Haas  wrote:
> >>
> >> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
> >>> On Fri, 19 Sep 2014, Florian Haas wrote:
>  Hi Sage,
> 
>  was the off-list reply intentional?
> >>>
> >>> Whoops!  Nope :)
> >>>
>  On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
> >> So, disaster is a pretty good description. Would anyone from the core
> >> team like to suggest another course of action or workaround, or are
> >> Dan and I generally on the right track to make the best out of a
> >> pretty bad situation?
> >
> > The short term fix would probably be to just prevent backfill for the 
> > time
> > being until the bug is fixed.
> 
>  As in, osd max backfills = 0?
> >>>
> >>> Yeah :)
> >>>
> >>> Just managed to reproduce the problem...
> >>>
> >>> sage
> >>
> >> Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
> >> Thanks!
> >
> > Indeed :) Thanks Sage!
> > wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
> 
> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
> slow requests :)

That's great news!

But, please be careful.  This code hasn't been reiewed yet or been through 
any testing!  I would hold off on further backfills until it's merged.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-19 Thread Dan van der Ster
On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
 wrote:
>> On 19 Sep 2014, at 08:12, Florian Haas  wrote:
>>
>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
>>> On Fri, 19 Sep 2014, Florian Haas wrote:
 Hi Sage,

 was the off-list reply intentional?
>>>
>>> Whoops!  Nope :)
>>>
 On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
>> So, disaster is a pretty good description. Would anyone from the core
>> team like to suggest another course of action or workaround, or are
>> Dan and I generally on the right track to make the best out of a
>> pretty bad situation?
>
> The short term fix would probably be to just prevent backfill for the time
> being until the bug is fixed.

 As in, osd max backfills = 0?
>>>
>>> Yeah :)
>>>
>>> Just managed to reproduce the problem...
>>>
>>> sage
>>
>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
>> Thanks!
>
> Indeed :) Thanks Sage!
> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now…

Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)

Thanks again!

Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-19 Thread Dan Van Der Ster
> On 19 Sep 2014, at 08:12, Florian Haas  wrote:
> 
> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>> Hi Sage,
>>> 
>>> was the off-list reply intentional?
>> 
>> Whoops!  Nope :)
>> 
>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
> So, disaster is a pretty good description. Would anyone from the core
> team like to suggest another course of action or workaround, or are
> Dan and I generally on the right track to make the best out of a
> pretty bad situation?
 
 The short term fix would probably be to just prevent backfill for the time
 being until the bug is fixed.
>>> 
>>> As in, osd max backfills = 0?
>> 
>> Yeah :)
>> 
>> Just managed to reproduce the problem...
>> 
>> sage
> 
> Saw the wip branch. Color me freakishly impressed on the turnaround. :) 
> Thanks!

Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now…
Cheers, 
DanN�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil  wrote:
> On Fri, 19 Sep 2014, Florian Haas wrote:
>> Hi Sage,
>>
>> was the off-list reply intentional?
>
> Whoops!  Nope :)
>
>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
>> >> So, disaster is a pretty good description. Would anyone from the core
>> >> team like to suggest another course of action or workaround, or are
>> >> Dan and I generally on the right track to make the best out of a
>> >> pretty bad situation?
>> >
>> > The short term fix would probably be to just prevent backfill for the time
>> > being until the bug is fixed.
>>
>> As in, osd max backfills = 0?
>
> Yeah :)
>
> Just managed to reproduce the problem...
>
> sage

Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Sage Weil
On Fri, 19 Sep 2014, Florian Haas wrote:
> Hi Sage,
> 
> was the off-list reply intentional?

Whoops!  Nope :)

> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil  wrote:
> >> So, disaster is a pretty good description. Would anyone from the core
> >> team like to suggest another course of action or workaround, or are
> >> Dan and I generally on the right track to make the best out of a
> >> pretty bad situation?
> >
> > The short term fix would probably be to just prevent backfill for the time
> > being until the bug is fixed.
> 
> As in, osd max backfills = 0?

Yeah :)

Just managed to reproduce the problem...

sage

> > The root of the problem seems to be that it is trying to trim snaps that
> > aren't there.  I'm trying to reproduce the issue now!  Hopefully the fix
> > is simple...
> >
> > http://tracker.ceph.com/issues/9487
> >
> > Thanks!
> > sage
> 
> Thanks. :)
> 
> Cheers,
> Florian
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster
 wrote:
> Hi,
>
> September 18 2014 9:03 PM, "Florian Haas"  wrote:
>> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster 
>>  wrote:
>>
>>> Hi Florian,
>>>
>>> On Sep 18, 2014 7:03 PM, Florian Haas  wrote:
 Hi Dan,

 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.

 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
  wrote:
>>> Do I understand your issue report correctly in that you have found
>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>> applied when iterating from PG to PG, rather than from snap to snap?
>>> If so, then I'm guessing that that can hardly be intentional…
>
>
> I’m beginning to agree with you on that guess. AFAICT, the normal 
> behavior of the snap trimmer
>>> is
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
>>> at
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.

 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.
>>>
>>> True, as am I. The current sleep is useful in this case, but since we'd 
>>> normally only expect up
>> to
>>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
>>> would finish rather
>>> quickly anyway. Latency would surely be increased momentarily, but I 
>>> wouldn't expect 90s slow
>>> requests like I have with the 3 snap_trimq single PG.
>>>
>>> Possibly the sleep is useful in both places.
>>>
> We could move the snap trim sleep into the SnapTrimmer state machine, for 
> example in
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for 
 even longer now due to
>> the
 sleep.
>
> To solve that we could limit the number of trims per instance of the 
> SnapTrimmer, like I’ve
>> done
 in this pull req: https://github.com/ceph/ceph/pull/2516
> Breaking out of the trimmer like that should allow IOs to the trimming PG 
> to get through.
>
> The second aspect of this issue is why are the purged_snaps being lost to 
> begin with. I’ve
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps (e.g.
>>> of
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, 
 which is one signature
>>> of
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).

 Hmmm, I'm not sure if I confirm that. I see "adding snap X to
 purged_snaps", but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.
>>>
>>> With such a long snap_trimq there in your log, I suspect you're seeing the 
>>> exact same behavior as
>> I
>>> am. In my case the first snap trimmed is snap 1, of course because that is 
>>> the first rm'd snap,
>> and
>>> the contents of your pool are surely different. I also see the ENOENT 
>>> messages... again
>> confirming
>>> those snaps were already trimmed. Anyway, what I've observed is that a 
>>> large snap_trimq like that
>>> will block the OSD until they are all re-trimmed.
>>
>> That's... a mess.
>>
>> So what is your workaround for recovery? My hunch would be to
>>
>> - stop all access to the cluster;
>> - set nodown and noout so that other OSDs don't mark spinning OSDs
>> down (which would cause all sorts of primary and PG reassignments,
>> useless backfill/recovery when mon osd down out interval expires,
>> etc.);
>> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
>> so that at least *between* PGs, the OSD has a chance to respond to
>> heartbeats and do whatever else it needs to do;
>> - let the snap trim play itself out over several hours (days?).
>>
>
> What I've been doing is I just continue draining my OSDs, two at a time. Each 
> time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour 
> it takes to drain) while a single PG re-trims, leading to ~100 slow requests. 
> The OSD must still be responding to the peer pings, since other OSDs do not 
> mark it down. Luckily this doesn't happen with every single movement of our 
> pool 5 PGs, otherwise it would be a disaster like you said.

So just to clarify, what you're doing is out of the 

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan van der Ster
-- Dan van der Ster || Data & Storage Services || CERN IT Department --

September 18 2014 9:12 PM, "Dan van der Ster"  
wrote: 
> Hi,
> 
> September 18 2014 9:03 PM, "Florian Haas"  wrote:
> 
>> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster 
>>  wrote:
>> 
>>> Hi Florian,
>>> 
>>> On Sep 18, 2014 7:03 PM, Florian Haas  wrote: 
 Hi Dan,
 
 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.
 
 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
  wrote: 
>>> Do I understand your issue report correctly in that you have found
>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>> applied when iterating from PG to PG, rather than from snap to snap?
>>> If so, then I'm guessing that that can hardly be intentional…
> 
> I’m beginning to agree with you on that guess. AFAICT, the normal 
> behavior of the snap trimmer
>>> 
>>> is 
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
>>> 
>>> at 
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.
 
 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.
>>> 
>>> True, as am I. The current sleep is useful in this case, but since we'd 
>>> normally only expect up
>> 
>> to 
>>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
>>> would finish rather
>>> quickly anyway. Latency would surely be increased momentarily, but I 
>>> wouldn't expect 90s slow
>>> requests like I have with the 3 snap_trimq single PG.
>>> 
>>> Possibly the sleep is useful in both places.
>>> 
> We could move the snap trim sleep into the SnapTrimmer state machine, for 
> example in
 
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for 
 even longer now due to
>> 
>> the 
 sleep. 
> To solve that we could limit the number of trims per instance of the 
> SnapTrimmer, like I’ve
>> 
>> done 
 in this pull req: https://github.com/ceph/ceph/pull/2516 
> Breaking out of the trimmer like that should allow IOs to the trimming PG 
> to get through.
> 
> The second aspect of this issue is why are the purged_snaps being lost to 
> begin with. I’ve
 
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps
> (e.g.
>>> 
>>> of 
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, 
 which is one
> signature
>>> 
>>> of 
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).
 
 Hmmm, I'm not sure if I confirm that. I see "adding snap X to
 purged_snaps", but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.
>>> 
>>> With such a long snap_trimq there in your log, I suspect you're seeing the 
>>> exact same behavior
> as
>> 
>> I 
>>> am. In my case the first snap trimmed is snap 1, of course because that is 
>>> the first rm'd snap,
>> 
>> and 
>>> the contents of your pool are surely different. I also see the ENOENT 
>>> messages... again
>> 
>> confirming 
>>> those snaps were already trimmed. Anyway, what I've observed is that a 
>>> large snap_trimq like
> that
>>> will block the OSD until they are all re-trimmed.
>> 
>> That's... a mess.
>> 
>> So what is your workaround for recovery? My hunch would be to
>> 
>> - stop all access to the cluster;
>> - set nodown and noout so that other OSDs don't mark spinning OSDs
>> down (which would cause all sorts of primary and PG reassignments,
>> useless backfill/recovery when mon osd down out interval expires,
>> etc.);
>> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
>> so that at least *between* PGs, the OSD has a chance to respond to
>> heartbeats and do whatever else it needs to do;
>> - let the snap trim play itself out over several hours (days?).
> 
> What I've been doing is I just continue draining my OSDs, two at a time. Each 
> time, 1-2 other OSDs
> become blocked for a couple minutes (out of the ~1 hour it takes to drain) 
> while a single PG
> re-trims, leading to ~100 slow requests. The OSD must still be responding to 
> the peer pings, since
> other OSDs do not mark it down. Luckily this doesn't 

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan van der Ster
Hi,

September 18 2014 9:03 PM, "Florian Haas"  wrote: 
> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster  
> wrote:
> 
>> Hi Florian,
>> 
>> On Sep 18, 2014 7:03 PM, Florian Haas  wrote: 
>>> Hi Dan,
>>> 
>>> saw the pull request, and can confirm your observations, at least
>>> partially. Comments inline.
>>> 
>>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>>  wrote:
>> Do I understand your issue report correctly in that you have found
>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> applied when iterating from PG to PG, rather than from snap to snap?
>> If so, then I'm guessing that that can hardly be intentional…
 
 
 I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
 of the snap trimmer
>> is
>>> to trim one single snap, the one which is in the snap_trimq but not yet in 
>>> purged_snaps. So the
>>> only time the current sleep implementation could be useful is if we rm’d a 
>>> snap across many PGs
>> at
>>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
>>> anyway since you’d at
>>> most need to trim O(100) PGs.
>>> 
>>> Hmm. I'm actually seeing this in a system where the problematic snaps
>>> could *only* have been RBD snaps.
>> 
>> True, as am I. The current sleep is useful in this case, but since we'd 
>> normally only expect up
> to
>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
>> would finish rather
>> quickly anyway. Latency would surely be increased momentarily, but I 
>> wouldn't expect 90s slow
>> requests like I have with the 3 snap_trimq single PG.
>> 
>> Possibly the sleep is useful in both places.
>> 
 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in
>>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
>>> through to the OSD, but of
>>> course the trimming PG would remain locked. And it would be locked for even 
>>> longer now due to
> the
>>> sleep.
 
 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve
> done
>>> in this pull req: https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG 
 to get through.
 
 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve
>>> managed to reproduce that on my test cluster. All you have to do is create 
>>> many pool snaps (e.g.
>> of
>>> a nearly empty pool), then rmsnap all those snapshots. Then use crush 
>>> reweight to move the PGs
>>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, 
>>> which is one signature
>> of
>>> this lost purged_snaps issue. To reproduce slow requests the number of 
>>> snaps purged needs to be
>>> O(1).
>>> 
>>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>>> purged_snaps", but only after the snap has been purged. See
>>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>>> fact that the OSD tries to trim a snap only to get an ENOENT is
>>> probably indicative of something being fishy with the snaptrimq and/or
>>> the purged_snaps list as well.
>> 
>> With such a long snap_trimq there in your log, I suspect you're seeing the 
>> exact same behavior as
> I
>> am. In my case the first snap trimmed is snap 1, of course because that is 
>> the first rm'd snap,
> and
>> the contents of your pool are surely different. I also see the ENOENT 
>> messages... again
> confirming
>> those snaps were already trimmed. Anyway, what I've observed is that a large 
>> snap_trimq like that
>> will block the OSD until they are all re-trimmed.
> 
> That's... a mess.
> 
> So what is your workaround for recovery? My hunch would be to
> 
> - stop all access to the cluster;
> - set nodown and noout so that other OSDs don't mark spinning OSDs
> down (which would cause all sorts of primary and PG reassignments,
> useless backfill/recovery when mon osd down out interval expires,
> etc.);
> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
> so that at least *between* PGs, the OSD has a chance to respond to
> heartbeats and do whatever else it needs to do;
> - let the snap trim play itself out over several hours (days?).
> 

What I've been doing is I just continue draining my OSDs, two at a time. Each 
time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it 
takes to drain) while a single PG re-trims, leading to ~100 slow requests. The 
OSD must still be responding to the peer pings, since other OSDs do not mark it 
down. Luckily this doesn't happen with every single movement of our pool 5 PGs, 
otherwise it would be a disaster like you said.

Cheers, Dan

> That sounds utterly awful, but if anyone has a better idea (other than
> "wait until the patch is merged"), I'd be all ears.
> 
> Cheers
> Florian
--
To unsubscribe from this list: send the line "unsubs

Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
 wrote:
> Hi Florian,
>
> On Sep 18, 2014 7:03 PM, Florian Haas  wrote:
>>
>> Hi Dan,
>>
>> saw the pull request, and can confirm your observations, at least
>> partially. Comments inline.
>>
>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>  wrote:
>> >>> Do I understand your issue report correctly in that you have found
>> >>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> >>> applied when iterating from PG to PG, rather than from snap to snap?
>> >>> If so, then I'm guessing that that can hardly be intentional…
>> >
>> >
>> > I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
>> > of the snap trimmer is to trim one single snap, the one which is in the 
>> > snap_trimq but not yet in purged_snaps. So the only time the current sleep 
>> > implementation could be useful is if we rm’d a snap across many PGs at 
>> > once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
>> > anyway since you’d at most need to trim O(100) PGs.
>>
>> Hmm. I'm actually seeing this in a system where the problematic snaps
>> could *only* have been RBD snaps.
>>
>
> True, as am I. The current sleep is useful in this case, but since we'd 
> normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap 
> across all of those PGs would finish rather quickly anyway. Latency would 
> surely be increased momentarily, but I wouldn't expect 90s slow requests like 
> I have with the 3 snap_trimq single PG.
>
> Possibly the sleep is useful in both places.
>
>> > We could move the snap trim sleep into the SnapTrimmer state machine, for 
>> > example in ReplicatedPG::NotTrimming::react. This should allow other IOs 
>> > to get through to the OSD, but of course the trimming PG would remain 
>> > locked. And it would be locked for even longer now due to the sleep.
>> >
>> > To solve that we could limit the number of trims per instance of the 
>> > SnapTrimmer, like I’ve done in this pull req: 
>> > https://github.com/ceph/ceph/pull/2516
>> > Breaking out of the trimmer like that should allow IOs to the trimming PG 
>> > to get through.
>> >
>> > The second aspect of this issue is why are the purged_snaps being lost to 
>> > begin with. I’ve managed to reproduce that on my test cluster. All you 
>> > have to do is create many pool snaps (e.g. of a nearly empty pool), then 
>> > rmsnap all those snapshots. Then use crush reweight to move the PGs 
>> > around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, 
>> > which is one signature of this lost purged_snaps issue. To reproduce slow 
>> > requests the number of snaps purged needs to be O(1).
>>
>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>> purged_snaps", but only after the snap has been purged. See
>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>> fact that the OSD tries to trim a snap only to get an ENOENT is
>> probably indicative of something being fishy with the snaptrimq and/or
>> the purged_snaps list as well.
>>
>
> With such a long snap_trimq there in your log, I suspect you're seeing the 
> exact same behavior as I am. In my case the first snap trimmed is snap 1, of 
> course because that is the first rm'd snap, and the contents of your pool are 
> surely different. I also see the ENOENT messages... again confirming those 
> snaps were already trimmed. Anyway, what I've observed is that a large 
> snap_trimq like that will block the OSD until they are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
"wait until the patch is merged"), I'd be all ears.

Cheers
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 wrote:
>>> Do I understand your issue report correctly in that you have found
>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>> applied when iterating from PG to PG, rather than from snap to snap?
>>> If so, then I'm guessing that that can hardly be intentional…
>
>
> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of 
> the snap trimmer is to trim one single snap, the one which is in the 
> snap_trimq but not yet in purged_snaps. So the only time the current sleep 
> implementation could be useful is if we rm’d a snap across many PGs at once, 
> e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway 
> since you’d at most need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.

> We could move the snap trim sleep into the SnapTrimmer state machine, for 
> example in ReplicatedPG::NotTrimming::react. This should allow other IOs to 
> get through to the OSD, but of course the trimming PG would remain locked. 
> And it would be locked for even longer now due to the sleep.
>
> To solve that we could limit the number of trims per instance of the 
> SnapTrimmer, like I’ve done in this pull req: 
> https://github.com/ceph/ceph/pull/2516
> Breaking out of the trimmer like that should allow IOs to the trimming PG to 
> get through.
>
> The second aspect of this issue is why are the purged_snaps being lost to 
> begin with. I’ve managed to reproduce that on my test cluster. All you have 
> to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap 
> all those snapshots. Then use crush reweight to move the PGs around. With 
> debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one 
> signature of this lost purged_snaps issue. To reproduce slow requests the 
> number of snaps purged needs to be O(1).

Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.

> Looking forward to any ideas someone might have.

So am I. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Dan Van Der Ster
(moving this discussion to -devel)

> Begin forwarded message:
> 
> From: Florian Haas 
> Date: 17 Sep 2014 18:02:09 CEST
> Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
> To: Dan Van Der Ster 
> Cc: Craig Lewis , "ceph-us...@lists.ceph.com" 
> 
> 
> On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
>  wrote:
>> From: Florian Haas 
>> Sent: Sep 17, 2014 5:33 PM
>> To: Dan Van Der Ster
>> Cc: Craig Lewis ;ceph-us...@lists.ceph.com
>> Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
>> 
>> On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
>>  wrote:
>>> Hi Florian,
>>> 
 On 17 Sep 2014, at 17:09, Florian Haas  wrote:
 
 Hi Craig,
 
 just dug this up in the list archives.
 
 On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis 
 wrote:
> In the interest of removing variables, I removed all snapshots on all
> pools,
> then restarted all ceph daemons at the same time.  This brought up osd.8
> as
> well.
 
 So just to summarize this: your 100% CPU problem at the time went away
 after you removed all snapshots, and the actual cause of the issue was
 never found?
 
 I am seeing a similar issue now, and have filed
 http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
 again. Can you take a look at that issue and let me know if anything
 in the description sounds familiar?
>>> 
>>> 
>>> Could your ticket be related to the snap trimming issue I’ve finally
>>> narrowed down in the past couple days?
>>> 
>>>  http://tracker.ceph.com/issues/9487
>>> 
>>> Bump up debug_osd to 20 then check the log during one of your incidents.
>>> If it is busy logging the snap_trimmer messages, then it’s the same issue.
>>> (The issue is that rbd pools have many purged_snaps, but sometimes after
>>> backfilling a PG the purged_snaps list is lost and thus the snap trimmer
>>> becomes very busy whilst re-trimming thousands of snaps. During that time (a
>>> few minutes on my cluster) the OSD is blocked.)
>> 
>> That sounds promising, thank you! debug_osd=10 should actually be
>> sufficient as those snap_trim messages get logged at that level. :)
>> 
>> Do I understand your issue report correctly in that you have found
>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> applied when iterating from PG to PG, rather than from snap to snap?
>> If so, then I'm guessing that that can hardly be intentional…


I’m beginning to agree with you on that guess. AFAICT, the normal behavior of 
the snap trimmer is to trim one single snap, the one which is in the snap_trimq 
but not yet in purged_snaps. So the only time the current sleep implementation 
could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool 
snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most 
need to trim O(100) PGs.

We could move the snap trim sleep into the SnapTrimmer state machine, for 
example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
through to the OSD, but of course the trimming PG would remain locked. And it 
would be locked for even longer now due to the sleep.

To solve that we could limit the number of trims per instance of the 
SnapTrimmer, like I’ve done in this pull req: 
https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to 
get through.

The second aspect of this issue is why are the purged_snaps being lost to begin 
with. I’ve managed to reproduce that on my test cluster. All you have to do is 
create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those 
snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, 
you will see "adding snap 1 to purged_snaps”, which is one signature of this 
lost purged_snaps issue. To reproduce slow requests the number of snaps purged 
needs to be O(1).

Looking forward to any ideas someone might have.

Cheers, Dan



N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i