Hi Trey,

Greg's probably correct that a snap trim bug made these snapshots untrimable.
(possibly you set osd_pg_max_concurrent_snap_trims = 0 ? see
https://tracker.ceph.com/issues/54396)

There is one other very relevant MDS issue in this area that's been on
my todo for awhile -- so I'll do a mini-brain dump here for the
record.
(Unless someone else already fixed this).

tl;dr MDS purges large files very very slowly, and there's a trivial fix:

  ceph config set mds filer_max_purge_ops 40

The MDS has a pretty tight throttle on how quickly it purges rados
after files have been deleted by a client.
This is particularly noticeable if users delete very large files.

You can see the state of the "purge queue" by looking at perf dump
output of the relevant active MDS.
For example, from a previous case like this I worked on:

{
  "pq_executing_ops": 480372,
  "pq_executing_ops_high_water": 480436,
  "pq_executing": 1,
  "pq_executing_high_water": 56,
  "pq_executed": 30003,
  "pq_item_in_journal": 66734
}

In that example -- the purge queue contains 66734 items (i.e. files)
that the MDS needs to delete.
Each file is striped across several 4MB objects, and files are purged
one at a time. (That's the meaning of pq_executing = 1 there)
In this case, the number of rados objects to delete, for that 1 file,
is pq_executing_ops =480372.
So that file has 480372 x 4MB parts == 1.92TB.

The MDS purges those underlying rados objects by sending
`filer_max_purge_ops` in parallel to the OSDs.
The default filer_max_purge_ops is only 10 -- meaning that the MDS is
at most only ever asking 10 OSDs to delete rados objects.
I've found that increasing filer_max_purge_ops to 40 is a good fix,
even for very large active clusters.

Greg: Maybe we should change the default to 40 -- or even better, the
MDS should use something like pg_num/2 or pg_num/4 for the data pool,
so that effective filer purge ops will auto-scale with cluster size.

Hope that helps,

Dan

--
Dan van der Ster
Ceph Executive Council | CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | [email protected]



On Thu, Oct 2, 2025 at 6:32 AM Trey Palmer <[email protected]> wrote:
>
> Hi,
>
> Some months ago we deleted about 1.4PB net of CephFS data.  Approximately
> 600TB net of the space we expected to reclaim, did not get reclaimed.
>
> The cluster in question is now on Reef, but it was on Pacific when this
> happened.  The cluster has 660 15TB NVMe OSD's on 55 nodes, plus 5 separate
> mon nodes.  We almost exclusively use it for CephFS, though there is a
> small RBD pool for VM images.  At the time we had a single MDS for the
> entire cluster, but we are now breaking that CephFS instance apart into
> multiple instances to spread the load (we tried multiple MDS ranks but it
> regressed performance and caused a near-disaster).
>
> Anyway, we often delete ~500TB at a time and we haven't previously run into
> this orphaned object problem.  At least, not at a scale big enough for us
> to notice -- generally I've gotten back the space I expected to free up.
>
> Also we have a 30PB raw spinning disk cluster running Quincy that we
> deleted 2 PB net from at the same time, and it didn't exhibit this
> problem.  So my assumption is that there is a bug in the version we were
> running that we exposed by deleting so much data at once.
>
> Our support vendor did some digging using some special utilities and found
> that there are a bunch of orphaned objects from 2 particular snapshots,
> where the MDS has no reference to the objects, but they still exist on disk.
>
> They feel that it would be somewhat difficult and risky to try to find and
> delete these objects via rados.  However, obviously, 600+ *net* TB of
> all-NVMe storage is quite a lot of money to just let go to waste.   That's
> effectively over 1.1 PB, once you figure EC overhead and the need to not
> fill the cluster over about 70%.
>
> So to reclaim the pool I have created a plan to move everything off of the
> pool, and then delete the pool.  Which we are largely needing to do anyway,
> due to splitting to multiple CephFS instances.
>
> My question is, am I likely to run into problems with this?  For example,
> will I be able to do 'ceph fs rm_data_pool'  once there are no longer any
> objects associated with the CephFS instance on the pool, or will the MDS
> have ghost object records that cause the command to balk?
>
> Thanks a lot for any insight,
>
> Trey Palmer
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to