On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
<daniel.vanders...@cern.ch> wrote:
> Hi Florian,
>
> On Sep 18, 2014 7:03 PM, Florian Haas <flor...@hastexo.com> wrote:
>>
>> Hi Dan,
>>
>> saw the pull request, and can confirm your observations, at least
>> partially. Comments inline.
>>
>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>> <daniel.vanders...@cern.ch> wrote:
>> >>> Do I understand your issue report correctly in that you have found
>> >>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> >>> applied when iterating from PG to PG, rather than from snap to snap?
>> >>> If so, then I'm guessing that that can hardly be intentional…
>> >
>> >
>> > I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
>> > of the snap trimmer is to trim one single snap, the one which is in the 
>> > snap_trimq but not yet in purged_snaps. So the only time the current sleep 
>> > implementation could be useful is if we rm’d a snap across many PGs at 
>> > once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
>> > anyway since you’d at most need to trim O(100) PGs.
>>
>> Hmm. I'm actually seeing this in a system where the problematic snaps
>> could *only* have been RBD snaps.
>>
>
> True, as am I. The current sleep is useful in this case, but since we'd 
> normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap 
> across all of those PGs would finish rather quickly anyway. Latency would 
> surely be increased momentarily, but I wouldn't expect 90s slow requests like 
> I have with the 30000 snap_trimq single PG.
>
> Possibly the sleep is useful in both places.
>
>> > We could move the snap trim sleep into the SnapTrimmer state machine, for 
>> > example in ReplicatedPG::NotTrimming::react. This should allow other IOs 
>> > to get through to the OSD, but of course the trimming PG would remain 
>> > locked. And it would be locked for even longer now due to the sleep.
>> >
>> > To solve that we could limit the number of trims per instance of the 
>> > SnapTrimmer, like I’ve done in this pull req: 
>> > https://github.com/ceph/ceph/pull/2516
>> > Breaking out of the trimmer like that should allow IOs to the trimming PG 
>> > to get through.
>> >
>> > The second aspect of this issue is why are the purged_snaps being lost to 
>> > begin with. I’ve managed to reproduce that on my test cluster. All you 
>> > have to do is create many pool snaps (e.g. of a nearly empty pool), then 
>> > rmsnap all those snapshots. Then use crush reweight to move the PGs 
>> > around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, 
>> > which is one signature of this lost purged_snaps issue. To reproduce slow 
>> > requests the number of snaps purged needs to be O(10000).
>>
>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>> purged_snaps", but only after the snap has been purged. See
>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>> fact that the OSD tries to trim a snap only to get an ENOENT is
>> probably indicative of something being fishy with the snaptrimq and/or
>> the purged_snaps list as well.
>>
>
> With such a long snap_trimq there in your log, I suspect you're seeing the 
> exact same behavior as I am. In my case the first snap trimmed is snap 1, of 
> course because that is the first rm'd snap, and the contents of your pool are 
> surely different. I also see the ENOENT messages... again confirming those 
> snaps were already trimmed. Anyway, what I've observed is that a large 
> snap_trimq like that will block the OSD until they are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
"wait until the patch is merged"), I'd be all ears.

Cheers
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to