[ceph-users] Re: multiple OSD crash, unfound objects

Michael Thomas Sun, 22 Nov 2020 11:14:37 -0800

Hi Frank,

From my understanding, with my current filesystem layout, I should beable to remove the "broken" pool once the data has been moved off of it.This is because the "broken" pool is not the default data pool.According to the documentation[1]:


  fs rm_data_pool <file system name> <pool name/id>

"This command removes the specified pool from the list of data pools forthe file system. If any files have layouts for the removed data pool,the file data will become unavailable. The default data pool (whencreating the file system) cannot be removed."

My default data pool (triply replicated on SSD) is still healthy. The"broken" pool is EC on HDD, and while it holds a majority of thefilesystem data (~400TB), it is not the root of the filesystem.


My plan would be:

* Create a new data pool matching the "broken" pool

* Create a parallel directory tree matching the directories that aremapped to the "broken" pool. eg Broken: /ceph/frames/..., New:/ceph/frames.new/...* Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory treeto map the content to the new data pool

* Use parallel+rsync to copy data from the broken pool to the new pool.

* After each directory gets filled in the new pool, mv/rename the oldand new directories so that users start accessing the data from the newpool.* Delete data from the renamed old pool directories as they arereplaced, to keep the OSDs from filling up* After all data is moved off of the old pool (verified by checkingceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs,as well as rados ls, ceph df), remove the pool from the fs.

This is effectively the same strategy I did when moving frequentlyaccessed directories from the EC pool to a replicated SSD pool, exceptthat in the previous situation I didn't need to remove any pools at theend. It's time consuming, because every file on the "broken" pool needsto be copied, but it minimizes downtime. Being able to add sometemporary new OSDs to the new pool (but not the "broken" pool) wouldreduce some pressure of filling up the OSDs. If the old and new poolsuse the same crush rule, would disabling backfilling+rebalancing keepthe OSDs from being used in the old pool until the old pool is deleted(with the exception of the occasional new file)?


--Mike
[1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems



On 11/22/20 12:19 PM, Frank Schilder wrote:

Dear Michael,

I was also wondering whether deleting the broken pool could clean up 
everything. The difficulty is, that while migrating a pool to new devices is 
easy via a crush rule change, migrating data between pools is not so easy. In 
particular, if you can't afford downtime.

In case you can afford some downtime, it might be possible to migrate fast by 
creating a new pool and use the pool copy command to migrate the data (rados 
cppool ...). Its important that the FS is shutdown (no MDS active) during this 
copy process. After copy, one could either rename the pools to have the copy 
match the fs data pool name, or change the data pool at the top level 
directory. You might need to set some pool meta data by hand, notably, the fs 
tag.

Having said that, I have no idea how a ceph fs reacts if presented with a 
replacement data pool. Although I don't believe that meta data contains the 
pool IDs, I cannot exclude that complication. The copy pool variant should be 
tested with an isolated FS first.

The other option is what you describe, create a new data pool, make the fs root 
placed on this pool and copy every file onto itself. This should also do the 
trick. However, with this method you will not be able to get rid of the broken 
pool. After the copy, you could, however, reduce the number of PGs to below the 
unhealthy one and the broken PG(s) might get deleted cleanly. Then you still 
have a surplus pool, but at least all PGs are clean.

I hope one of these will work. Please post your experience here.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <w...@caltech.edu>
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/23/20 3:07 AM, Frank Schilder wrote:

Hi Michael.

I still don't see any traffic to the pool, though I'm also unsure how much 
traffic is to be expected.


Probably not much. If ceph df shows that the pool contains some objects, I 
guess that's sorted.

That osdmaptool crashes indicates that your cluster runs with corrupted 
internal data. I tested your crush map and you should get complete PGs for the 
fs data pool. That you don't and that osdmaptool crashes points at a corruption 
of internal data. I'm afraid this is the point where you need support from ceph 
developers and should file a tracker report 
(https://tracker.ceph.com/projects/ceph/issues). A short description of the 
origin of the situation with the osdmaptool output and a reference to this 
thread linked in should be sufficient. Please post a link to the ticket here.


https://tracker.ceph.com/issues/48059

In parallel, you should probably open a new thread focussed on the osd map 
corruption. Maybe there are low-level commands to repair it.


Will do.

You should wait with trying to clean up the unfound objects until this is 
resolved. Not sure about adding further storage either. To me, this sounds 
quite serious.


Another approach that I'm considering is to create a new pool using the
same set of OSDs, adding it to the set of cephfs data pools, and
migrating the data from the "broken" pool to the new pool.

I have some additional unused storage that I could add to this new pool,
if I can figure out the right crush rules to make sure they don't get
used for the "broken" pool too.

--Mike

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: multiple OSD crash, unfound objects

Reply via email to