Hi,

Some months ago we deleted about 1.4PB net of CephFS data.  Approximately
600TB net of the space we expected to reclaim, did not get reclaimed.

The cluster in question is now on Reef, but it was on Pacific when this
happened.  The cluster has 660 15TB NVMe OSD's on 55 nodes, plus 5 separate
mon nodes.  We almost exclusively use it for CephFS, though there is a
small RBD pool for VM images.  At the time we had a single MDS for the
entire cluster, but we are now breaking that CephFS instance apart into
multiple instances to spread the load (we tried multiple MDS ranks but it
regressed performance and caused a near-disaster).

Anyway, we often delete ~500TB at a time and we haven't previously run into
this orphaned object problem.  At least, not at a scale big enough for us
to notice -- generally I've gotten back the space I expected to free up.

Also we have a 30PB raw spinning disk cluster running Quincy that we
deleted 2 PB net from at the same time, and it didn't exhibit this
problem.  So my assumption is that there is a bug in the version we were
running that we exposed by deleting so much data at once.

Our support vendor did some digging using some special utilities and found
that there are a bunch of orphaned objects from 2 particular snapshots,
where the MDS has no reference to the objects, but they still exist on disk.

They feel that it would be somewhat difficult and risky to try to find and
delete these objects via rados.  However, obviously, 600+ *net* TB of
all-NVMe storage is quite a lot of money to just let go to waste.   That's
effectively over 1.1 PB, once you figure EC overhead and the need to not
fill the cluster over about 70%.

So to reclaim the pool I have created a plan to move everything off of the
pool, and then delete the pool.  Which we are largely needing to do anyway,
due to splitting to multiple CephFS instances.

My question is, am I likely to run into problems with this?  For example,
will I be able to do 'ceph fs rm_data_pool'  once there are no longer any
objects associated with the CephFS instance on the pool, or will the MDS
have ghost object records that cause the command to balk?

Thanks a lot for any insight,

Trey Palmer
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to