Hi, Some months ago we deleted about 1.4PB net of CephFS data. Approximately 600TB net of the space we expected to reclaim, did not get reclaimed.
The cluster in question is now on Reef, but it was on Pacific when this happened. The cluster has 660 15TB NVMe OSD's on 55 nodes, plus 5 separate mon nodes. We almost exclusively use it for CephFS, though there is a small RBD pool for VM images. At the time we had a single MDS for the entire cluster, but we are now breaking that CephFS instance apart into multiple instances to spread the load (we tried multiple MDS ranks but it regressed performance and caused a near-disaster). Anyway, we often delete ~500TB at a time and we haven't previously run into this orphaned object problem. At least, not at a scale big enough for us to notice -- generally I've gotten back the space I expected to free up. Also we have a 30PB raw spinning disk cluster running Quincy that we deleted 2 PB net from at the same time, and it didn't exhibit this problem. So my assumption is that there is a bug in the version we were running that we exposed by deleting so much data at once. Our support vendor did some digging using some special utilities and found that there are a bunch of orphaned objects from 2 particular snapshots, where the MDS has no reference to the objects, but they still exist on disk. They feel that it would be somewhat difficult and risky to try to find and delete these objects via rados. However, obviously, 600+ *net* TB of all-NVMe storage is quite a lot of money to just let go to waste. That's effectively over 1.1 PB, once you figure EC overhead and the need to not fill the cluster over about 70%. So to reclaim the pool I have created a plan to move everything off of the pool, and then delete the pool. Which we are largely needing to do anyway, due to splitting to multiple CephFS instances. My question is, am I likely to run into problems with this? For example, will I be able to do 'ceph fs rm_data_pool' once there are no longer any objects associated with the CephFS instance on the pool, or will the MDS have ghost object records that cause the command to balk? Thanks a lot for any insight, Trey Palmer _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
