On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour <daniel.vanders...@cern.ch> wrote: > Hi Florian, > > On Sep 18, 2014 7:03 PM, Florian Haas <flor...@hastexo.com> wrote: >> >> Hi Dan, >> >> saw the pull request, and can confirm your observations, at least >> partially. Comments inline. >> >> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster >> <daniel.vanders...@cern.ch> wrote: >> >>> Do I understand your issue report correctly in that you have found >> >>> setting osd_snap_trim_sleep to be ineffective, because it's being >> >>> applied when iterating from PG to PG, rather than from snap to snap? >> >>> If so, then I'm guessing that that can hardly be intentional… >> > >> > >> > I’m beginning to agree with you on that guess. AFAICT, the normal behavior >> > of the snap trimmer is to trim one single snap, the one which is in the >> > snap_trimq but not yet in purged_snaps. So the only time the current sleep >> > implementation could be useful is if we rm’d a snap across many PGs at >> > once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem >> > anyway since you’d at most need to trim O(100) PGs. >> >> Hmm. I'm actually seeing this in a system where the problematic snaps >> could *only* have been RBD snaps. >> > > True, as am I. The current sleep is useful in this case, but since we'd > normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap > across all of those PGs would finish rather quickly anyway. Latency would > surely be increased momentarily, but I wouldn't expect 90s slow requests like > I have with the 30000 snap_trimq single PG. > > Possibly the sleep is useful in both places. > >> > We could move the snap trim sleep into the SnapTrimmer state machine, for >> > example in ReplicatedPG::NotTrimming::react. This should allow other IOs >> > to get through to the OSD, but of course the trimming PG would remain >> > locked. And it would be locked for even longer now due to the sleep. >> > >> > To solve that we could limit the number of trims per instance of the >> > SnapTrimmer, like I’ve done in this pull req: >> > https://github.com/ceph/ceph/pull/2516 >> > Breaking out of the trimmer like that should allow IOs to the trimming PG >> > to get through. >> > >> > The second aspect of this issue is why are the purged_snaps being lost to >> > begin with. I’ve managed to reproduce that on my test cluster. All you >> > have to do is create many pool snaps (e.g. of a nearly empty pool), then >> > rmsnap all those snapshots. Then use crush reweight to move the PGs >> > around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, >> > which is one signature of this lost purged_snaps issue. To reproduce slow >> > requests the number of snaps purged needs to be O(10000). >> >> Hmmm, I'm not sure if I confirm that. I see "adding snap X to >> purged_snaps", but only after the snap has been purged. See >> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the >> fact that the OSD tries to trim a snap only to get an ENOENT is >> probably indicative of something being fishy with the snaptrimq and/or >> the purged_snaps list as well. >> > > With such a long snap_trimq there in your log, I suspect you're seeing the > exact same behavior as I am. In my case the first snap trimmed is snap 1, of > course because that is the first rm'd snap, and the contents of your pool are > surely different. I also see the ENOENT messages... again confirming those > snaps were already trimmed. Anyway, what I've observed is that a large > snap_trimq like that will block the OSD until they are all re-trimmed.
That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). That sounds utterly awful, but if anyone has a better idea (other than "wait until the patch is merged"), I'd be all ears. Cheers Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html