Thanks Dan. osd_pg_max_concurrent_snap_trims is set to the default of 1. filer_max_purge_ops is 300.
We pretty routinely delete around 500TB of data before later deleting the snapshots of it, and that hasn't caused this problem before. On Fri, Oct 3, 2025 at 2:21 PM Dan van der Ster <[email protected]> wrote: > Hi Trey, > > Greg's probably correct that a snap trim bug made these snapshots > untrimable. > (possibly you set osd_pg_max_concurrent_snap_trims = 0 ? see > https://tracker.ceph.com/issues/54396) > > There is one other very relevant MDS issue in this area that's been on > my todo for awhile -- so I'll do a mini-brain dump here for the > record. > (Unless someone else already fixed this). > > tl;dr MDS purges large files very very slowly, and there's a trivial fix: > > ceph config set mds filer_max_purge_ops 40 > > The MDS has a pretty tight throttle on how quickly it purges rados > after files have been deleted by a client. > This is particularly noticeable if users delete very large files. > > You can see the state of the "purge queue" by looking at perf dump > output of the relevant active MDS. > For example, from a previous case like this I worked on: > > { > "pq_executing_ops": 480372, > "pq_executing_ops_high_water": 480436, > "pq_executing": 1, > "pq_executing_high_water": 56, > "pq_executed": 30003, > "pq_item_in_journal": 66734 > } > > In that example -- the purge queue contains 66734 items (i.e. files) > that the MDS needs to delete. > Each file is striped across several 4MB objects, and files are purged > one at a time. (That's the meaning of pq_executing = 1 there) > In this case, the number of rados objects to delete, for that 1 file, > is pq_executing_ops =480372. > So that file has 480372 x 4MB parts == 1.92TB. > > The MDS purges those underlying rados objects by sending > `filer_max_purge_ops` in parallel to the OSDs. > The default filer_max_purge_ops is only 10 -- meaning that the MDS is > at most only ever asking 10 OSDs to delete rados objects. > I've found that increasing filer_max_purge_ops to 40 is a good fix, > even for very large active clusters. > > Greg: Maybe we should change the default to 40 -- or even better, the > MDS should use something like pg_num/2 or pg_num/4 for the data pool, > so that effective filer purge ops will auto-scale with cluster size. > > Hope that helps, > > Dan > > -- > Dan van der Ster > Ceph Executive Council | CTO @ CLYSO > Try our Ceph Analyzer -- https://analyzer.clyso.com/ > https://clyso.com | [email protected] > > > > On Thu, Oct 2, 2025 at 6:32 AM Trey Palmer <[email protected]> wrote: > > > > Hi, > > > > Some months ago we deleted about 1.4PB net of CephFS data. Approximately > > 600TB net of the space we expected to reclaim, did not get reclaimed. > > > > The cluster in question is now on Reef, but it was on Pacific when this > > happened. The cluster has 660 15TB NVMe OSD's on 55 nodes, plus 5 > separate > > mon nodes. We almost exclusively use it for CephFS, though there is a > > small RBD pool for VM images. At the time we had a single MDS for the > > entire cluster, but we are now breaking that CephFS instance apart into > > multiple instances to spread the load (we tried multiple MDS ranks but it > > regressed performance and caused a near-disaster). > > > > Anyway, we often delete ~500TB at a time and we haven't previously run > into > > this orphaned object problem. At least, not at a scale big enough for us > > to notice -- generally I've gotten back the space I expected to free up. > > > > Also we have a 30PB raw spinning disk cluster running Quincy that we > > deleted 2 PB net from at the same time, and it didn't exhibit this > > problem. So my assumption is that there is a bug in the version we were > > running that we exposed by deleting so much data at once. > > > > Our support vendor did some digging using some special utilities and > found > > that there are a bunch of orphaned objects from 2 particular snapshots, > > where the MDS has no reference to the objects, but they still exist on > disk. > > > > They feel that it would be somewhat difficult and risky to try to find > and > > delete these objects via rados. However, obviously, 600+ *net* TB of > > all-NVMe storage is quite a lot of money to just let go to waste. > That's > > effectively over 1.1 PB, once you figure EC overhead and the need to not > > fill the cluster over about 70%. > > > > So to reclaim the pool I have created a plan to move everything off of > the > > pool, and then delete the pool. Which we are largely needing to do > anyway, > > due to splitting to multiple CephFS instances. > > > > My question is, am I likely to run into problems with this? For example, > > will I be able to do 'ceph fs rm_data_pool' once there are no longer any > > objects associated with the CephFS instance on the pool, or will the MDS > > have ghost object records that cause the command to balk? > > > > Thanks a lot for any insight, > > > > Trey Palmer > > _______________________________________________ > > ceph-users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
