On Fri, Oct 3, 2025 at 11:21 AM Dan van der Ster <[email protected]> wrote: > > Hi Trey, > > Greg's probably correct that a snap trim bug made these snapshots untrimable. > (possibly you set osd_pg_max_concurrent_snap_trims = 0 ? see > https://tracker.ceph.com/issues/54396) > > There is one other very relevant MDS issue in this area that's been on > my todo for awhile -- so I'll do a mini-brain dump here for the > record. > (Unless someone else already fixed this). > > tl;dr MDS purges large files very very slowly, and there's a trivial fix: > > ceph config set mds filer_max_purge_ops 40 > > The MDS has a pretty tight throttle on how quickly it purges rados > after files have been deleted by a client. > This is particularly noticeable if users delete very large files. > > You can see the state of the "purge queue" by looking at perf dump > output of the relevant active MDS. > For example, from a previous case like this I worked on: > > { > "pq_executing_ops": 480372, > "pq_executing_ops_high_water": 480436, > "pq_executing": 1, > "pq_executing_high_water": 56, > "pq_executed": 30003, > "pq_item_in_journal": 66734 > } > > In that example -- the purge queue contains 66734 items (i.e. files) > that the MDS needs to delete. > Each file is striped across several 4MB objects, and files are purged > one at a time. (That's the meaning of pq_executing = 1 there) > In this case, the number of rados objects to delete, for that 1 file, > is pq_executing_ops =480372. > So that file has 480372 x 4MB parts == 1.92TB. > > The MDS purges those underlying rados objects by sending > `filer_max_purge_ops` in parallel to the OSDs. > The default filer_max_purge_ops is only 10 -- meaning that the MDS is > at most only ever asking 10 OSDs to delete rados objects. > I've found that increasing filer_max_purge_ops to 40 is a good fix, > even for very large active clusters. > > Greg: Maybe we should change the default to 40 -- or even better, the > MDS should use something like pg_num/2 or pg_num/4 for the data pool, > so that effective filer purge ops will auto-scale with cluster size.
I think some kind of change here would be very sensible, but I'm not sure how to formulate it. Do we just want to deprecate filer_max_purg_ops and introduce a new config filer_pgs_per_purge_op or something? We can't increase a static default because the size of a Ceph cluster varies so much — many of the Kubernetes deployments are only 3 OSDs. :/ On Mon, Oct 6, 2025 at 9:02 AM Trey Palmer <[email protected]> wrote: > Frédéric, > > Thanks so much for looking into this. > > The documentation isn't all that clear, but my impression has been that > pool snapshots are an entirely different thing from CephFS snapshots. > > At least the documentation says this, and it sounds from the bug report > you posted like it's dealing with mon-managed snapshots? > > To avoid snap id collision between mon-managed snapshots and file system > snapshots, pools with mon-managed snapshots are not allowed to be attached > to a file system. Also, mon-managed snapshots can’t be created in pools > already attached to a file system either. > > I'd love for my impression to be incorrect and to be able to fix it this > way, though! > > Thanks again, > > Trey > > > > On Mon, Oct 6, 2025 at 10:26 AM Frédéric Nass <[email protected]> > wrote: > >> Hi Greg, >> >> This one? https://tracker.ceph.com/issues/64646 >> >> Symptoms: >> - CLONES are reported on 'rados df' while pool has no snapshots. >> - 'rados lssnap -p <pool_name>' command shows no snapshots but some >> clones are listed by 'rados listsnaps -p <pool_name> <object_name>' even >> sometimes with no 'head' object. >> >> @Trey, if this is the one --- make sure it is before running the command >> --- running a 'ceph osd pool force-remove-snap <pool_name>' should put all >> leaked clone objects back in the trim queue and the OSDs should get rid of >> them. >> > Right, you probably aren't using pool snapshots. There were a couple of issues like this and Matan might know the others, though? -Greg _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
