[ceph-users] Re: Orphaned CephFS objects

Gregory Farnum Fri, 17 Oct 2025 21:49:39 -0700

On Fri, Oct 3, 2025 at 11:21 AM Dan van der Ster <[email protected]>
wrote:
>
> Hi Trey,
>
> Greg's probably correct that a snap trim bug made these snapshots
untrimable.
> (possibly you set osd_pg_max_concurrent_snap_trims = 0 ? see
> https://tracker.ceph.com/issues/54396)
>
> There is one other very relevant MDS issue in this area that's been on
> my todo for awhile -- so I'll do a mini-brain dump here for the
> record.
> (Unless someone else already fixed this).
>
> tl;dr MDS purges large files very very slowly, and there's a trivial fix:
>
>   ceph config set mds filer_max_purge_ops 40
>
> The MDS has a pretty tight throttle on how quickly it purges rados
> after files have been deleted by a client.
> This is particularly noticeable if users delete very large files.
>
> You can see the state of the "purge queue" by looking at perf dump
> output of the relevant active MDS.
> For example, from a previous case like this I worked on:
>
> {
>   "pq_executing_ops": 480372,
>   "pq_executing_ops_high_water": 480436,
>   "pq_executing": 1,
>   "pq_executing_high_water": 56,
>   "pq_executed": 30003,
>   "pq_item_in_journal": 66734
> }
>
> In that example -- the purge queue contains 66734 items (i.e. files)
> that the MDS needs to delete.
> Each file is striped across several 4MB objects, and files are purged
> one at a time. (That's the meaning of pq_executing = 1 there)
> In this case, the number of rados objects to delete, for that 1 file,
> is pq_executing_ops =480372.
> So that file has 480372 x 4MB parts == 1.92TB.
>
> The MDS purges those underlying rados objects by sending
> `filer_max_purge_ops` in parallel to the OSDs.
> The default filer_max_purge_ops is only 10 -- meaning that the MDS is
> at most only ever asking 10 OSDs to delete rados objects.
> I've found that increasing filer_max_purge_ops to 40 is a good fix,
> even for very large active clusters.
>
> Greg: Maybe we should change the default to 40 -- or even better, the
> MDS should use something like pg_num/2 or pg_num/4 for the data pool,
> so that effective filer purge ops will auto-scale with cluster size.


I think some kind of change here would be very sensible, but I'm not sure
how to formulate it. Do we just want to deprecate filer_max_purg_ops and
introduce a new config filer_pgs_per_purge_op or something?
We can't increase a static default because the size of a Ceph cluster
varies so much — many of the Kubernetes deployments are only 3 OSDs. :/

On Mon, Oct 6, 2025 at 9:02 AM Trey Palmer <[email protected]> wrote:

> Frédéric,
>
> Thanks so much for looking into this.
>
> The documentation isn't all that clear, but my impression has been that
> pool snapshots are an entirely different thing from CephFS snapshots.
>
> At least the documentation says this, and it sounds from the bug report
> you posted like it's dealing with mon-managed snapshots?
>
> To avoid snap id collision between mon-managed snapshots and file system
> snapshots, pools with mon-managed snapshots are not allowed to be attached
> to a file system. Also, mon-managed snapshots can’t be created in pools
> already attached to a file system either.
>
> I'd love for my impression to be incorrect and to be able to fix it this
> way, though!
>
> Thanks again,
>
> Trey
>
>
>
> On Mon, Oct 6, 2025 at 10:26 AM Frédéric Nass <[email protected]>
> wrote:
>
>> Hi Greg,
>>
>> This one? https://tracker.ceph.com/issues/64646
>>
>> Symptoms:
>> - CLONES are reported on 'rados df' while pool has no snapshots.
>> - 'rados lssnap -p <pool_name>' command shows no snapshots but some
>> clones are listed by 'rados listsnaps -p <pool_name> <object_name>' even
>> sometimes with no 'head' object.
>>
>> @Trey, if this is the one --- make sure it is before running the command
>> --- running a 'ceph osd pool force-remove-snap <pool_name>' should put all
>> leaked clone objects back in the trim queue and the OSDs should get rid of
>> them.
>>
>
Right, you probably aren't using pool snapshots. There were a couple of
issues like this and Matan might know the others, though?
-Greg
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Orphaned CephFS objects

Reply via email to