[ceph-users] Invalid crush class
In 15.2.7, how can I remove an invalid crush class? I'm surprised that I was able to create it in the first place: [root@ceph1 bin]# ceph osd crush class ls [ "ssd", "JBOD.hdd", "nvme", "hdd" ] [root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd Invalid command: invalid chars . in JBOD.hdd osd crush class ls-osd : list all osds belonging to the specific > Error EINVAL: invalid command There are no devices mapped to this class: [root@ceph1 bin]# ceph osd crush tree | grep JBOD | wc -l 0 --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rebalance after draining - why?
Try this: ceph osd crush reweight osd.XX 0 --Mike On 5/28/22 15:02, Nico Schottelius wrote: Good evening dear fellow Ceph'ers, when removing OSDs from a cluster, we sometimes use ceph osd reweight osd.XX 0 and wait until the OSD's content has been redistributed. However, when then finally stopping and removing it, Ceph is again rebalancing. I assume this is due to a position that is removed in the CRUSH map and thus the logical placement is "wrong". (Am I wrong about that?) I wonder, is there a way to tell ceph properly that a particular OSD is planned to leave the cluster and to remove the data to the "correct new position" instead of doing the rebalance dance twice? Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: managed block storage stopped working
On 1/7/22 16:49, Marc wrote: Where else can I look to find out why the managed block storage isn't accessible anymore? ceph -s ? I guess it is not showing any errors, and there is probably nothing with ceph, you can do an rbdmap and see if you can just map an image. Then try mapping an image with the user credentials ovirt is using, maybe some auth key has been deleted. Finally figured out the problem. A routing change on our core switch was preventing the OSDs from being able to talk to the ovirg engine. For some reason I thought the engine delegated all disk creation/attach operations to the ovirt hosts, but it seems that the engine still needs to be able to reach the OSDs. Access to the rbd images from the hosts was working fine, but access to the rbd images (and OSDs) from the engine was failing. After adding back the missing route, I'm able to create and attach new ceph rbd volumes. Thanks for the nudge in the right direction, --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] managed block storage stopped working
...sorta. I have a ovirt-4.4.2 system installed a couple of years ago and set up managed block storage using ceph Octopus[1]. This has been working well since it was originally set up. In late November we had some network issues on one of our ovirt hosts, as well a seperate network issue that took many ceph OSDs offline. This was eventually recovered, and 2 of the 3 VMs that use managed block storage started working again. The third did not. We eventually discovered that ovirt was not able to access the ceph rbd images, which is odd because two VMs are actively reading and writing to ceph block devices. We are also no longer able to create new ovirt disks using the managed block driver. /var/log/cinderlib/cinderlib.log on the ovirt-engine is empty. /var/log/ovirt-engine/engine.log shows the attempt to connect to the storage, which eventually errors out with no helpful message: 2022-01-07 11:36:47,398-06 INFO [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default task-1) [6613fac6-dd2f-4d22-993b-d805b2b572cd] Running command: AttachDiskToVmCommand internal: false. Entities affected : ID: 804b259a-c580-436b-a5ba-decdd0a2ccbd Type: VMAction group CONFIGURE_VM_STORAGE with role type USER, ID: 32c537e9-42cf-4648-b33b-2723374416e1 Type: DiskAction group ATTACH_DISK with role type USER 2022-01-07 11:36:47,415-06 INFO [org.ovirt.engine.core.bll.storage.disk.managedblock.ConnectManagedBlockStorageDeviceCommand] (default task-1) [46265b18] Running command: ConnectManagedBlockStorageDeviceCommand internal: true. 2022-01-07 11:39:00,248-06 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Thread pool 'default' is using 0 threads out of 1, 5 threads waiting for tasks. 2022-01-07 11:39:00,248-06 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Thread pool 'engine' is using 0 threads out of 500, 32 threads waiting for tasks and 0 tasks in queue. 2022-01-07 11:39:00,248-06 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Thread pool 'engineScheduledThreadPool' is using 0 threads out of 1, 100 threads waiting for tasks. 2022-01-07 11:39:00,248-06 INFO [org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] (EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) [] Thread pool 'engineThreadMonitoringThreadPool' is using 1 threads out of 1, 0 threads waiting for tasks. 2022-01-07 11:41:19,774-06 INFO [org.ovirt.engine.core.bll.aaa.LoginOnBehalfCommand] (default task-6) [103222ef] Running command: LoginOnBehalfCommand internal: true. 2022-01-07 11:41:19,832-06 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-6) [103222ef] EVENT_ID: USER_LOGIN_ON_BEHALF(1,401), Executed login on behalf - for user admin. 2022-01-07 11:41:19,848-06 INFO [org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-6) [32106489] Running command: LogoutSessionCommand internal: true. 2022-01-07 11:41:19,853-06 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-6) [32106489] EVENT_ID: USER_VDC_LOGOUT(31), User SYSTEM connected from 'UNKNOWN' using session 'pSzmWpAZSakSozpj4HQF2bic6EKUClj5wni+i9GPIlmdLIqfnAG9LYqb2MbO34fOuskBvjmTPbe4WRGFWUfmbQ==' logged out. 2022-01-07 11:41:47,405-06 ERROR [org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand'. Where else can I look to find out why the managed block storage isn't accessible anymore? --Mike [1]https://lists.ovirt.org/archives/list/us...@ovirt.org/thread/KHCLXVOCELHOR3G7SH3GDPGRKITCW7UY/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [External Email] Re: ceph-objectstore-tool core dump
On 10/4/21 11:57 AM, Dave Hall wrote: > I also had a delay on the start of the repair scrub when I was dealing with > this issue. I ultimately increased the number of simultaneous scrubs, but > I think you could also temporarily disable scrubs and then re-issue the 'pg > repair'. (But I'm not one of the experts on this.) > > My perception is that between EC pools, large HDDs, and the overall OSD > count, there might need to be some tuning to assure that scrubs can get > scheduled: A large HDD contains pieces of more PGs. Each PG in an EC pool > is spread across more disks than a replication pool. Thus, especially if > the number of OSDs is not large, there is an increased chance that more > than one scrub will want to read the same OSD. Scheduling nightmare if > the number of simultaneous scrubs is low and client traffic is given > priority. > > -Dave That seemed to be the case. After ~24 hours, 1 of the 8 repair tasks had completed. Unfortunately, it found another error that wasn't present before. After checking the SMART logs, it looks like this particular disk is failing. No sense in pursuing this any further; I'll be replacing it with a spare instead. I'll look into disabling scrubs the next time I need to schedule a repair. Hopefully it will run the repair jobs a bit sooner. Regards, --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-objectstore-tool core dump
On 10/3/21 12:08, 胡 玮文 wrote: 在 2021年10月4日,00:53,Michael Thomas 写道: I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph cluster. I was able to determine that they are all coming from the same OSD: osd.143. This host recently suffered from an unplanned power loss, so I'm not surprised that there may be some corruption. This PG is part of a EC 8+2 pool. The OSD logs from the PG's primary OSD show this and similar errors from the PG's most recent deep scrub: 2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log [ERR] : 23.1fa shard 143(1) soid 23:5f8c3d4e:::1179969.0168:head : candidate had a read error In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the PG. This accomplished nothing. Next I ran a shallow fsck on the OSD: I expect this ‘ceph pg repair’ command could handle this kind of errors. After issuing this command, the pg should enter a state like “active+clean+scrubbing+deep+inconsistent+repair”, then you wait for the repair to finish (this can take hours), and you should be able to recover from the inconsistent state. What do you mean by “This accomplished nothing”? The PG never entered the 'repair' state, nor did anything appear in the primary OSD logs about a request for repair. After more than 24 hours, the PG remained listed as 'inconsistent'. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph-objectstore-tool core dump
I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph cluster. I was able to determine that they are all coming from the same OSD: osd.143. This host recently suffered from an unplanned power loss, so I'm not surprised that there may be some corruption. This PG is part of a EC 8+2 pool. The OSD logs from the PG's primary OSD show this and similar errors from the PG's most recent deep scrub: 2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log [ERR] : 23.1fa shard 143(1) soid 23:5f8c3d4e:::1179969.0168:head : candidate had a read error In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the PG. This accomplished nothing. Next I ran a shallow fsck on the OSD: # ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-143 fsck success I estimated that a deep fsck will take ~24 hours to run on this mostly full 16TB HDD. Before doing that, I wanted to see if I could simply remove the offending object and let ceph recover itself. Unfortunately, ceph-objectstore-tool core dumps when I try to remove this object: # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-143 --pgid 23.1fa '{"oid":"1179969.0168","key":"","snapid":-2,"hash":1924936186,"max":0,"pool":23,"namespace":"","shard_id":1,"max":0}' remove *** Caught signal (Segmentation fault) ** in thread 7fdc491a88c0 thread_name:ceph-objectstor ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable) 1: (()+0xf630) [0x7fdc3e62a630] 2: (__pthread_rwlock_rdlock()+0xb) [0x7fdc3e62614b] 3: (BlueStore::collection_bits(boost::intrusive_ptr&)+0x148) [0x5583c8fa7878] 4: (main()+0x4b50) [0x5583c8a85270] 5: (__libc_start_main()+0xf5) [0x7fdc3cfe7555] 6: (()+0x39d3a0) [0x5583c8ab03a0] Segmentation fault (core dumped) As a last resort, I know that I can map this OID back to the cephfs file and simply remove/restore the offending file to fix the object. But before I do that, I'm running a deep fsck to see if that can fix this and the other inconsistent objects. In the meantime, I wondered if there was anything else I could do to clean up this inconsistent PG? --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs auditing
Is there a way to log or track which cephfs files are being accessed? This would help us in planning where to place certain datasets based on popularity, eg on a EC HDD pool or a replicated SSD pool. I know I can run inotify on the ceph clients, but I was hoping that the MDS would have a way to log this information centrally. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: HEALTH_WARN - Recovery Stuck?
I recently had a similar issue when reducing the number of PGs on a pool. A few OSDs became backfillful even though there was enough space; the OSDs were just not balanced well. To fix, I reweighted the most-full OSDs: ceph osd reweight-by-utilization 120 After it finished (~1 hour), I had fewer backfillful OSDs. I repeated this 2 more times, after which the OSDs were no longer backfillful and recovery data movement resumed. Once the recovery was complete, I reweighted all OSDs back to 1.0, and all was fine. --Mike On 4/12/21 12:30 PM, Ml Ml wrote: Hello, i kind of ran out of disk space, so i added another host with osd.37. But it does not seem to move much data on it. (85MB in 2h) Any idea why the recovery process seems to be stuck? Should i fix the 4 backfillfull osds first? (by changing the weight)? root@ceph01:~# ceph -s cluster: id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df health: HEALTH_WARN 4 backfillfull osd(s) 9 nearfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull 4 pool(s) backfillfull services: mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d) mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs task status: scrub status: mds.backup.ceph06.hdjehi: idle data: pools: 4 pools, 1185 pgs objects: 24.69M objects, 45 TiB usage: 149 TiB used, 42 TiB / 191 TiB avail pgs: 5388809/74059569 objects misplaced (7.276%) 950 active+clean 232 active+remapped+backfill_wait 2 active+remapped+backfilling 1 active+remapped+backfill_wait+backfill_toofull io: recovery: 0 B/s, 171 keys/s, 16 objects/s progress: Rebalancing after osd.37 marked in (2h) [] (remaining: 6d) root@ceph01:~# ceph health detail HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull; 4 pool(s) backfillfull [WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s) osd.28 is backfill full osd.32 is backfill full osd.66 is backfill full osd.68 is backfill full [WRN] OSD_NEARFULL: 9 nearfull osd(s) osd.11 is near full osd.24 is near full osd.27 is near full osd.39 is near full osd.40 is near full osd.42 is near full osd.43 is near full osd.45 is near full osd.69 is near full [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull pg 23.295 is active+remapped+backfill_wait+backfill_toofull, acting [8,67,32] [WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull pool 'backurne-rbd' is backfillfull pool 'device_health_metrics' is backfillfull pool 'cephfs.backup.meta' is backfillfull pool 'cephfs.backup.data' is backfillfull root@ceph01:~# ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS TYPE NAME -1 182.59897 - 191 TiB 149 TiB 149 TiB35 GiB 503 GiB 42 TiB 77.96 1.00- root default -2 24.62473 - 29 TiB 22 TiB 22 TiB 5.0 GiB 80 GiB 7.1 TiB 75.23 0.96- host ceph01 0hdd2.3 1.0 2.7 TiB 2.2 TiB 2.2 TiB 665 MiB 8.0 GiB 480 GiB 82.43 1.06 53 up osd.0 1hdd2.2 1.0 2.7 TiB 2.1 TiB 2.1 TiB 446 MiB 7.5 GiB 590 GiB 78.44 1.01 49 up osd.1 4hdd2.67029 0.91066 2.7 TiB 2.2 TiB 2.2 TiB 484 MiB 7.9 GiB 440 GiB 83.90 1.08 53 up osd.4 8hdd2.3 1.0 2.7 TiB 2.1 TiB 2.1 TiB 490 MiB 7.9 GiB 533 GiB 80.49 1.03 51 up osd.8 11hdd1.71660 1.0 1.7 TiB 1.5 TiB 1.5 TiB 406 MiB 5.5 GiB 200 GiB 88.60 1.14 36 up osd.11 12hdd1.2 1.0 2.7 TiB 1.2 TiB 1.2 TiB 366 MiB 4.9 GiB 1.5 TiB 43.89 0.56 28 up osd.12 14hdd2.2 1.0 2.7 TiB 2.0 TiB 2.0 TiB 418 MiB 7.1 GiB 693 GiB 74.66 0.96 47 up osd.14 18hdd2.2 1.0 2.7 TiB 2.0 TiB 1.9 TiB 434 MiB 7.3 GiB 737 GiB 73.05 0.94 47 up osd.18 22hdd1.0 1.0 1.7 TiB 890 GiB 886 GiB 110 MiB 3.6 GiB 868 GiB 50.62 0.65 20 up osd.22 30hdd1.5 1.0 1.7 TiB 1.4 TiB 1.3 TiB 361 MiB 4.9 GiB 370 GiB 78.93 1.01 32 up osd.30 33hdd1.5 0.97437 1.6 TiB 1.4 TiB 1.4 TiB 397 MiB 5.4 GiB 213 GiB 87.20 1.12 34 up osd.33 64hdd3.33789 0.89752 3.3 TiB 2.7 TiB
[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?
Hi Joshua, I'll dig into this output a bit more later, but here are my thoughts right now. I'll preface this by saying that I've never had to clean up from unrecoverable incomplete PGs, so some of what I suggest may not work/apply or be the ideal fix in your case. Correct me if I'm wrong, but you are willing to throw away all of the data on this pool? This should make it easier because we don't have to worry about recovering any lost data. If this is the case, then I think the general strategy would be: 1) Identify and remove any files/directories in cephfs that are located on this pool (based on ceph.file.layout.pool=claypool and ceph.dir.layout.pool=claypool). Use 'unlink' instead of 'rm' to remove the files; it should be less prone to hanging. 2) Wait a bit for ceph to clean up any unreferenced objects. Watch the output of 'ceph df' to see how many objects are listed for the pool. 3) Use 'rados -p claypool ls' to identify the remaining objects. Use the OID identifier to calculate the inode number of each file, then search cephfs to identify which files these belong to. I would expect it would be none, as you already deleted the files in step 1. 4) With nothing in the cephfs metadata referring to the objects anymore, it should be safe to remove them with 'rados -p rm'. 5) Remove the now-empty pool from cephfs 6) Remove the now-empty pool from ceph Can you also include the output of 'ceph df'? --Mike On 4/9/21 7:31 AM, Joshua West wrote: Thank you Mike! This is honestly a way more detailed reply than I was expecting. You've equipped me with new tools to work with. Thank you! I don't actually have any unfound pgs... only "incomplete" ones, which limits the usefulness of: `grep recovery_unfound` `ceph pg $pg list_unfound` `ceph pg $pg mark_unfound_lost delete` I don't seem to see equivalent commands for incomplete pgs, save for grep of course. This does make me slightly more hopeful that recovery might be possible if the pgs are incomplete and stuck, but not unfound..? Not going to get my hopes too high. Going to attach a few items just to keep from bugging me, if anyone can take a glance, it would be appreciated. In the meantime, in the absence of the above commands, what's the best way to clean this up under the assumption that the data is lost? ~Joshua Joshua West President 403-456-0072 CAYK.ca On Thu, Apr 8, 2021 at 6:15 PM Michael Thomas wrote: Hi Joshua, I have had a similar issue three different times on one of my cephfs pools (15.2.10). The first time this happened I had lost some OSDs. In all cases I ended up with degraded PGs with unfound objects that could not be recovered. Here's how I recovered from the situation. Note that this will permanently remove the affected files from ceph. Restoring them from backup is an excercise left to the reader. * Make a list of the affected PGs: ceph pg dump_stuck | grep recovery_unfound > pg.txt * Make a list of the affected objects (OIDs): cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > oid.txt * Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and put the results in a file called 'inum.txt' * On a ceph client, find the files that correspond to the affected inodes: cat inum.txt | while read inum ; do echo -n "${inum} " ; find /ceph/frames/O3/raw -inum ${inum} ; done > files.txt * It may be helpful to put this table of PG, OID, inum, and files into a spreadsheet to keep track of what's been done. * On the ceph client, use 'unlink' to remove the files from the filesystem. Do not use 'rm', as it will hang while calling 'stat()' on each file. Even unlink may hang when you first try it. If it does hang, do the following to get it unstuck: - Reboot the client - Restart each mon and the mgr. I rebooted each mon/mgr, but it may be sufficient to restart the services without a reboot. - Try using 'unlink' again * After all of the affected files have been removed, go through the list of PGs and remove the unfound OIDs: ceph pg $pgid mark_unfound_lost delete ...or if you're feeling brave, delete them all at once: cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg mark_unfound_lost delete ; done * Watch the output of 'ceph -s' to see the health of the pools/pgs recover. * Restore the deleted files from backup, or decide that you don't care about them and don't do anything This procedure lets you fix the problem without deleting the affected pool. To be honest, the first time it happened, my solution was to first copy all of the data off of the affected pool and onto a new pool. I later found this to be unnecessary. But if you want to pursue this, here's what I suggest: * Follow the steps above to get rid of the affected files. I feel this should still be done even though you don't care
[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?
Hi Joshua, I have had a similar issue three different times on one of my cephfs pools (15.2.10). The first time this happened I had lost some OSDs. In all cases I ended up with degraded PGs with unfound objects that could not be recovered. Here's how I recovered from the situation. Note that this will permanently remove the affected files from ceph. Restoring them from backup is an excercise left to the reader. * Make a list of the affected PGs: ceph pg dump_stuck | grep recovery_unfound > pg.txt * Make a list of the affected objects (OIDs): cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > oid.txt * Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and put the results in a file called 'inum.txt' * On a ceph client, find the files that correspond to the affected inodes: cat inum.txt | while read inum ; do echo -n "${inum} " ; find /ceph/frames/O3/raw -inum ${inum} ; done > files.txt * It may be helpful to put this table of PG, OID, inum, and files into a spreadsheet to keep track of what's been done. * On the ceph client, use 'unlink' to remove the files from the filesystem. Do not use 'rm', as it will hang while calling 'stat()' on each file. Even unlink may hang when you first try it. If it does hang, do the following to get it unstuck: - Reboot the client - Restart each mon and the mgr. I rebooted each mon/mgr, but it may be sufficient to restart the services without a reboot. - Try using 'unlink' again * After all of the affected files have been removed, go through the list of PGs and remove the unfound OIDs: ceph pg $pgid mark_unfound_lost delete ...or if you're feeling brave, delete them all at once: cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg mark_unfound_lost delete ; done * Watch the output of 'ceph -s' to see the health of the pools/pgs recover. * Restore the deleted files from backup, or decide that you don't care about them and don't do anything This procedure lets you fix the problem without deleting the affected pool. To be honest, the first time it happened, my solution was to first copy all of the data off of the affected pool and onto a new pool. I later found this to be unnecessary. But if you want to pursue this, here's what I suggest: * Follow the steps above to get rid of the affected files. I feel this should still be done even though you don't care about saving the data, to prevent corruption in the cephfs metadata. * Go through the entire filesystem and look for: - files that are located on the pool (ceph.file.layout.pool = $pool_name) - directories that are set to write files to the pool (ceph.dir.layout.pool = $pool_name) * After you confirm that no files or directories are pointing at the pool anymore, run 'ceph df' and look at the number of objects in the pool. Ideally, it would be zero. But more than likely it isn't. This could be a simple mismatch in the object count in cephfs (harmless), or there could be clients with open filehandles on files that have been removed. such objects will still appear in the rados listing of the pool[1]: rados -p $pool_name ls for obj in $(rados -p $pool_name ls); do echo $obj; rados -p $pool_name getxattr parent | strings; done * To check for clients with access to these stray objects, dump the mds cache: ceph daemon mds.ceph1 dump cache /tmp/cache.txt * Look for lines that refer to the stray objects, like this: [inode 0x1020fbc [2,head] ~mds0/stray6/1020fbc auth v7440537 s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863 1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 | caps=1 authpin=0 0x563a7e52a000] * The 'caps' field in the output above contains the client session id (eg 9541437). Search the MDS for sessions that match to identify the client: ceph daemon mds.ceph1 session ls > session.txt Search through 'session.txt' for matching entries. This will give you the IP address of the client: "id": 9541437, "entity": { "name": { "type": "client", "num": 9541437 }, "addr": { "type": "v1", "addr": "10.13.5.48:0", "nonce": 2011077845 } }, * Restart the client's connection to ceph to get it to drop the cap. I did this by rebooting the client, but there may be gentler ways to do it. * Once you've done this clean up, it should be safe to remove the pool from cephfs: ceph fs rm_data_pool $fs_name $pool_name * Once the pool has been detached from cephfs, you can remove it from ceph altogether: ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it Hope this helps, --Mike [1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html On 4/8/21 5:41 PM,
[ceph-users] Re: Removing secondary data pool from mds
Hi Frank, I finally got around to removing the data pool. It went without a hitch. Ironically, about a week before I got around to removing the pool, I suffered the same problem as before, except this time it wasn't a power glitch that took out the OSDs, it was my own careless self who decided to reboot too many OSD hosts at the same time. The multiple OSDs went down while I was copying a lot of data into ceph. And as before, this left a bunch of corrupted files that caused stat() and unlink() to hang. I recovered it the same as before, by removing the files from the filesystem, then removing the lost objects from the PGs. Unlike last time, I did not try to copy the good files into a new pool. Fortunately, this cleanup process worked fine. For those watching from home, here are the steps I took to clean up: * Restart all mons (I rebooted all of them, but it may have been enough to simply restart the mds). Reboot the client that is experiencing the hang. This didn't fix the problem with stat() hanging, but did allow unlink() (and /usr/bin/unlink) to remove the files without hanging. I'm not sure which of these steps is the necessary one, as I did all of them before I was able to proceed. * Make a list of the affected PGs: ceph pg dump_stuck | grep recovery_unfound > pg.txt * Make a list of the affected OIDs: cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > oid.txt * Convert the OID numbers to inodes: cat oid.txt | awk '{print $2}' | sed -e 's/\..*//' | while read oid ; do printf "%d\n" 0x${oid} ; done > inum.txt * Find the filenames corresponding to the affected inodes (requires the /ceph filesystem to be mounted): cat inum.txt | while read inum ; do echo -n "${inum} " ; find /ceph/frames/O3/raw -inum ${inum} ; done > files.txt * Call /usr/bin/unlink on each of the files in files.txt. Don't use /usr/bin/rm, as it will hang when calling stat() before unlink(). * Remove the unfound objects: cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg $pg mark_unfound_lost delete ; done * Watch the output of 'ceph -s' to see the cluster become healthy again --Mike On 2/12/21 4:55 PM, Frank Schilder wrote: Hi Michael, I also think it would be safe to delete. The object count might be an incorrect reference count of lost objects that didn't get decremented. This might be fixed by running a deep scrub over all PGs in that pool. I don't know rados well enough to find out where such an object count comes from. However, ceph df is known to be imperfect. Maybe its just an accounting bug there. I think there were a couple of cases where people deleted all objects in a pool and ceph df would still report non-zero usage. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Michael Thomas Sent: 12 February 2021 22:35:25 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Removing secondary data pool from mds Hi Frank, We're not using snapshots. I was able to run: ceph daemon mds.ceph1 dump cache /tmp/cache.txt ...and scan for the stray object to find the cap id that was accessing the object. I matched this with the entity name in: ceph daemon mds.ceph1 session ls ...to determine the client host. The strays went away after I rebooted the offending client. With all access to the objects now cleared, I ran: ceph pg X.Y mark_unfound_lost delete ...on any remaining rados objects. At this point (at long last) the pool was able to return to the 'HEALTHY' status. However, there is one remaining bit that I don't understand. 'ceph df' returns 355 objects for the pool (fs.data.archive.frames): https://pastebin.com/vbZLhQmC ...but 'rados -p fs.data.archive.frames ls --all' returns no objects. So I'm not sure what these 355 objects were. Because of that, I haven't removed the pool from cephfs quite yet, even though I think it would be safe to do so. --Mike On 2/10/21 4:20 PM, Frank Schilder wrote: Hi Michael, out of curiosity, did the pool go away or did it put up a fight? I don't remember exactly, its a long time ago, but I believe stray objects on fs pools come from files still in snapshots but were deleted on the fs level. Such files are moved to special stray pools until the snapshot containing them is deleted as well. Not sure if this applies here though, there might be other occasions when objects go to stray. I updated the case concerning the underlying problem, but not too much progress either: https://tracker.ceph.com/issues/46847#change-184710 . I had PG degradation even using the recovery technique with before- and after crush maps. I was just lucky that I lost only 1 shard per object and ordinary recovery could fix it. Best regards, = Frank Schilder
[ceph-users] Re: Removing secondary data pool from mds
Hi Frank, We're not using snapshots. I was able to run: ceph daemon mds.ceph1 dump cache /tmp/cache.txt ...and scan for the stray object to find the cap id that was accessing the object. I matched this with the entity name in: ceph daemon mds.ceph1 session ls ...to determine the client host. The strays went away after I rebooted the offending client. With all access to the objects now cleared, I ran: ceph pg X.Y mark_unfound_lost delete ...on any remaining rados objects. At this point (at long last) the pool was able to return to the 'HEALTHY' status. However, there is one remaining bit that I don't understand. 'ceph df' returns 355 objects for the pool (fs.data.archive.frames): https://pastebin.com/vbZLhQmC ...but 'rados -p fs.data.archive.frames ls --all' returns no objects. So I'm not sure what these 355 objects were. Because of that, I haven't removed the pool from cephfs quite yet, even though I think it would be safe to do so. --Mike On 2/10/21 4:20 PM, Frank Schilder wrote: Hi Michael, out of curiosity, did the pool go away or did it put up a fight? I don't remember exactly, its a long time ago, but I believe stray objects on fs pools come from files still in snapshots but were deleted on the fs level. Such files are moved to special stray pools until the snapshot containing them is deleted as well. Not sure if this applies here though, there might be other occasions when objects go to stray. I updated the case concerning the underlying problem, but not too much progress either: https://tracker.ceph.com/issues/46847#change-184710 . I had PG degradation even using the recovery technique with before- and after crush maps. I was just lucky that I lost only 1 shard per object and ordinary recovery could fix it. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 21 December 2020 23:12:09 To: ceph-users@ceph.io Subject: [ceph-users] Removing secondary data pool from mds I have a cephfs secondary (non-root) data pool with unfound and degraded objects that I have not been able to recover[1]. I created an additional data pool and used "setfattr -n ceph.dir.layout.pool' and a very long rsync to move the files off of the degraded pool and onto the new pool. This has completed, and using find + 'getfattr -n ceph.file.layout.pool', I verified that no files are using the old pool anymore. No ceph.dir.layout.pool attributes point to the old pool either. However, the old pool still reports that there are objects in the old pool, likely the same ones that were unfound/degraded from before: https://pastebin.com/qzVA7eZr Based on a old message from the mailing list[2], I checked the MDS for stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray file.txt) and found 36 stray entries in the cache: https://pastebin.com/MHkpw3DV. However, I'm not certain how to map these stray cache objects to clients that may be accessing them. 'rados -p fs.data.archive.frames ls' shows 145 objects. Looking at the parent of each object shows 2 strays: for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p fs.data.archive.frames getxattr $obj parent | strings ; done [...] 1020fa1. 1020fa1 stray6 1020fbc. 1020fbc stray6 [...] ...before getting stuck on one object for over 5 minutes (then I gave up): 105b1af.0083 What can I do to make sure this pool is ready to be safely deleted from cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)? --Mike [1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF [2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Removing secondary data pool from mds
I have a cephfs secondary (non-root) data pool with unfound and degraded objects that I have not been able to recover[1]. I created an additional data pool and used "setfattr -n ceph.dir.layout.pool' and a very long rsync to move the files off of the degraded pool and onto the new pool. This has completed, and using find + 'getfattr -n ceph.file.layout.pool', I verified that no files are using the old pool anymore. No ceph.dir.layout.pool attributes point to the old pool either. However, the old pool still reports that there are objects in the old pool, likely the same ones that were unfound/degraded from before: https://pastebin.com/qzVA7eZr Based on a old message from the mailing list[2], I checked the MDS for stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray file.txt) and found 36 stray entries in the cache: https://pastebin.com/MHkpw3DV. However, I'm not certain how to map these stray cache objects to clients that may be accessing them. 'rados -p fs.data.archive.frames ls' shows 145 objects. Looking at the parent of each object shows 2 strays: for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p fs.data.archive.frames getxattr $obj parent | strings ; done [...] 1020fa1. 1020fa1 stray6 1020fbc. 1020fbc stray6 [...] ...before getting stuck on one object for over 5 minutes (then I gave up): 105b1af.0083 What can I do to make sure this pool is ready to be safely deleted from cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)? --Mike [1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF [2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, I was able to migrate the data off of the "broken" pool (fs.data.archive.frames) and onto the new one (fs.data.archive.newframes). I verified that no useful data is left on the "broken" pool: * 'find + getfattr -n ceph.file.layout.pool' shows no files on the bad pool * 'find + getfattr -n ceph.dir.layout.pool' shows no future files will land on the bad pool * 'ceph -s' shows some misplaced/degraded/unfound objects on the bad pool: data: pools: 14 pools, 3492 pgs objects: 111.94M objects, 425 TiB usage: 587 TiB used, 525 TiB / 1.1 PiB avail pgs: 68/893408279 objects degraded (0.000%) 35/893408279 objects misplaced (0.000%) 24/111943463 objects unfound (0.000%) 3480 active+clean 5active+recovery_unfound+degraded+remapped 4active+clean+scrubbing+deep 2active+recovery_unfound+undersized+degraded+remapped 1active+recovery_unfound+degraded * 'rados ls --pool fs.data.archive.frames' shows these orphaned objects. I extracted the first component of the rados object names (eg 1020fa1.0030) and ran 'find /ceph -inum XXX' to verify that none of these objects maps back to a known file in the cephfs filesystem. Here are the next steps that I plan to perform: * 'rados rm --pool fs.data.archive.frames ' on a couple of objects to see how ceph handles it. * 'rados purge fs.data.archive.frames' to purge all objects in the "broken" pool * ceph fs rm_data_pool fs.data.archive.frames Is there anything else you think I ought to check before finalizing the removal of this broken pool? --Mike On 11/22/20 1:59 PM, Frank Schilder wrote: Dear Michael, yes, your plan will work if the temporary space requirement can be addressed. Good luck! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Michael Thomas Sent: 22 November 2020 20:14:09 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects Hi Frank, From my understanding, with my current filesystem layout, I should be able to remove the "broken" pool once the data has been moved off of it. This is because the "broken" pool is not the default data pool. According to the documentation[1]: fs rm_data_pool "This command removes the specified pool from the list of data pools for the file system. If any files have layouts for the removed data pool, the file data will become unavailable. The default data pool (when creating the file system) cannot be removed." My default data pool (triply replicated on SSD) is still healthy. The "broken" pool is EC on HDD, and while it holds a majority of the filesystem data (~400TB), it is not the root of the filesystem. My plan would be: * Create a new data pool matching the "broken" pool * Create a parallel directory tree matching the directories that are mapped to the "broken" pool. eg Broken: /ceph/frames/..., New: /ceph/frames.new/... * Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree to map the content to the new data pool * Use parallel+rsync to copy data from the broken pool to the new pool. * After each directory gets filled in the new pool, mv/rename the old and new directories so that users start accessing the data from the new pool. * Delete data from the renamed old pool directories as they are replaced, to keep the OSDs from filling up * After all data is moved off of the old pool (verified by checking ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs, as well as rados ls, ceph df), remove the pool from the fs. This is effectively the same strategy I did when moving frequently accessed directories from the EC pool to a replicated SSD pool, except that in the previous situation I didn't need to remove any pools at the end. It's time consuming, because every file on the "broken" pool needs to be copied, but it minimizes downtime. Being able to add some temporary new OSDs to the new pool (but not the "broken" pool) would reduce some pressure of filling up the OSDs. If the old and new pools use the same crush rule, would disabling backfilling+rebalancing keep the OSDs from being used in the old pool until the old pool is deleted (with the exception of the occasional new file)? --Mike [1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems On 11/22/20 12:19 PM, Frank Schilder wrote: Dear Michael, I was also wondering whether deleting the broken pool could clean up everything. The difficulty is, that while migrating a pool to new devices is easy via a crush rule change, migrating data between pools is not so easy. In particular, if you can't afford downtime. In case you can afford some downtime, it might be possible to migrate fas
[ceph-users] Re: Whether removing device_health_metrics pool is ok or not
On 12/3/20 6:47 PM, Satoru Takeuchi wrote: Hi, Could you tell me whether it's ok to remove device_health_metrics pool after disabling device monitoring feature? I don't use device monitoring feature because I capture hardware information from other way. However, after disabling this feature, device_health_metrics pool stll exists. I don't want to concern HEALTH_WARN caused by problems in PGs of this pool. As a result of reading the source code of device monitoring module, it seems to be safe to remove this pool. Is my understanding correct? On my Octopus cluster I was running with a broken device_health_metrics pool (due to PG issues) for over a month with no other obvious ill effects. I finally removed and recreated it to make ceph stop complaining about it. Since then I have not seen any data written to the pool. I know this doesn't answer your question directly, but in my case removing the pool temporarily did not cause any harm. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Prometheus monitoring
I am gathering prometheus metrics from my (unhealthy) Octopus (15.2.4) cluster and notice a discrepency (or misunderstanding) with the ceph dashboard. In the dashboard, and with ceph -s, it reports 807 million objects objects: pgs: 169747/807333195 objects degraded (0.021%) 78570293/807333195 objects misplaced (9.732%) 24/101158245 objects unfound (0.000%) But in the prometheus metrics (and in ceph df), it reports almost a factor of 10 fewer objects (dominated by pool 7): # HELP ceph_pool_objects DF pool objects # TYPE ceph_pool_objects gauge ceph_pool_objects{pool_id="4"} 3920.0 ceph_pool_objects{pool_id="5"} 372743.0 ceph_pool_objects{pool_id="7"} 86972464.0 ceph_pool_objects{pool_id="8"} 9287431.0 ceph_pool_objects{pool_id="13"} 8961.0 ceph_pool_objects{pool_id="15"} 0.0 ceph_pool_objects{pool_id="17"} 4.0 ceph_pool_objects{pool_id="18"} 206.0 ceph_pool_objects{pool_id="19"} 8.0 ceph_pool_objects{pool_id="20"} 7.0 ceph_pool_objects{pool_id="21"} 22.0 ceph_pool_objects{pool_id="22"} 203.0 ceph_pool_objects{pool_id="23"} 4415522.0 Why are these two values different? How can I get the total number of objects (807 million) from the prometheus metrics? --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, From my understanding, with my current filesystem layout, I should be able to remove the "broken" pool once the data has been moved off of it. This is because the "broken" pool is not the default data pool. According to the documentation[1]: fs rm_data_pool "This command removes the specified pool from the list of data pools for the file system. If any files have layouts for the removed data pool, the file data will become unavailable. The default data pool (when creating the file system) cannot be removed." My default data pool (triply replicated on SSD) is still healthy. The "broken" pool is EC on HDD, and while it holds a majority of the filesystem data (~400TB), it is not the root of the filesystem. My plan would be: * Create a new data pool matching the "broken" pool * Create a parallel directory tree matching the directories that are mapped to the "broken" pool. eg Broken: /ceph/frames/..., New: /ceph/frames.new/... * Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree to map the content to the new data pool * Use parallel+rsync to copy data from the broken pool to the new pool. * After each directory gets filled in the new pool, mv/rename the old and new directories so that users start accessing the data from the new pool. * Delete data from the renamed old pool directories as they are replaced, to keep the OSDs from filling up * After all data is moved off of the old pool (verified by checking ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs, as well as rados ls, ceph df), remove the pool from the fs. This is effectively the same strategy I did when moving frequently accessed directories from the EC pool to a replicated SSD pool, except that in the previous situation I didn't need to remove any pools at the end. It's time consuming, because every file on the "broken" pool needs to be copied, but it minimizes downtime. Being able to add some temporary new OSDs to the new pool (but not the "broken" pool) would reduce some pressure of filling up the OSDs. If the old and new pools use the same crush rule, would disabling backfilling+rebalancing keep the OSDs from being used in the old pool until the old pool is deleted (with the exception of the occasional new file)? --Mike [1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems On 11/22/20 12:19 PM, Frank Schilder wrote: Dear Michael, I was also wondering whether deleting the broken pool could clean up everything. The difficulty is, that while migrating a pool to new devices is easy via a crush rule change, migrating data between pools is not so easy. In particular, if you can't afford downtime. In case you can afford some downtime, it might be possible to migrate fast by creating a new pool and use the pool copy command to migrate the data (rados cppool ...). Its important that the FS is shutdown (no MDS active) during this copy process. After copy, one could either rename the pools to have the copy match the fs data pool name, or change the data pool at the top level directory. You might need to set some pool meta data by hand, notably, the fs tag. Having said that, I have no idea how a ceph fs reacts if presented with a replacement data pool. Although I don't believe that meta data contains the pool IDs, I cannot exclude that complication. The copy pool variant should be tested with an isolated FS first. The other option is what you describe, create a new data pool, make the fs root placed on this pool and copy every file onto itself. This should also do the trick. However, with this method you will not be able to get rid of the broken pool. After the copy, you could, however, reduce the number of PGs to below the unhealthy one and the broken PG(s) might get deleted cleanly. Then you still have a surplus pool, but at least all PGs are clean. I hope one of these will work. Please post your experience here. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 22 November 2020 18:29:16 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/23/20 3:07 AM, Frank Schilder wrote: Hi Michael. I still don't see any traffic to the pool, though I'm also unsure how much traffic is to be expected. Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. That osdmaptool crashes indicates that your cluster runs with corrupted internal data. I tested your crush map and you should get complete PGs for the fs data pool. That you don't and that osdmaptool crashes points at a corruption of internal data. I'm afraid this is the point where you need support from ceph developers and should file a tracker report (https://tracker.cep
[ceph-users] Re: multiple OSD crash, unfound objects
On 10/23/20 3:07 AM, Frank Schilder wrote: Hi Michael. I still don't see any traffic to the pool, though I'm also unsure how much traffic is to be expected. Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. That osdmaptool crashes indicates that your cluster runs with corrupted internal data. I tested your crush map and you should get complete PGs for the fs data pool. That you don't and that osdmaptool crashes points at a corruption of internal data. I'm afraid this is the point where you need support from ceph developers and should file a tracker report (https://tracker.ceph.com/projects/ceph/issues). A short description of the origin of the situation with the osdmaptool output and a reference to this thread linked in should be sufficient. Please post a link to the ticket here. https://tracker.ceph.com/issues/48059 In parallel, you should probably open a new thread focussed on the osd map corruption. Maybe there are low-level commands to repair it. Will do. You should wait with trying to clean up the unfound objects until this is resolved. Not sure about adding further storage either. To me, this sounds quite serious. Another approach that I'm considering is to create a new pool using the same set of OSDs, adding it to the set of cephfs data pools, and migrating the data from the "broken" pool to the new pool. I have some additional unused storage that I could add to this new pool, if I can figure out the right crush rules to make sure they don't get used for the "broken" pool too. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: safest way to re-crush a pool
Yes, of course this works. For some reason I recall having trouble when I tried this on my first ceph install. But I think in that case I didn't change the crush tree, but instead I had changed the device classes without changing the crush tree. In any case, the re-crush worked fine. --Mike On 11/10/20 4:20 PM, dhils...@performair.com wrote: Michael; I run a Nautilus cluster, but all I had to do was change the rule associated with the pool, and ceph moved the data. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com -Original Message- From: Michael Thomas [mailto:w...@caltech.edu] Sent: Tuesday, November 10, 2020 1:32 PM To: ceph-users@ceph.io Subject: [ceph-users] safest way to re-crush a pool I'm setting up a radosgw for my ceph Octopus cluster. As soon as I started the radosgw service, I notice that it created a handful of new pools. These pools were assigned the 'replicated_data' crush rule automatically. I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush rule spans all device types. I would like radosgw to use a replicated SSD pool and avoid the HDDs. What is the recommended way to change the crush device class for these pools without risking the loss of any data in the pools? I will note that I have not yet written any user data to the pools. Everything in them was added by the radosgw process automatically. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] safest way to re-crush a pool
I'm setting up a radosgw for my ceph Octopus cluster. As soon as I started the radosgw service, I notice that it created a handful of new pools. These pools were assigned the 'replicated_data' crush rule automatically. I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush rule spans all device types. I would like radosgw to use a replicated SSD pool and avoid the HDDs. What is the recommended way to change the crush device class for these pools without risking the loss of any data in the pools? I will note that I have not yet written any user data to the pools. Everything in them was added by the radosgw process automatically. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
On 10/22/20 3:22 AM, Frank Schilder wrote: Could you also execute (and post the output of) # osdmaptool osd.map --test-map-pgs-dump --pool 7 osdmaptool dumped core. Here is stdout: https://pastebin.com/HPtSqcS1 The PG map for 7.39d matches the pg dump, with the expected difference of 2147483647 -> NONE. ...and here is stderr: https://pastebin.com/CrtwE54r Regards, --Mike with the osd map you pulled out (pool 7 should be the fs data pool)? Please check what mapping is reported for PG 7.39d? Just checking if osd map and pg dump agree here. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 22 October 2020 09:32:07 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Sounds good. Did you re-create the pool again? If not, please do to give the devicehealth manager module its storage. In case you can't see any IO, it might be necessary to restart the MGR to flush out a stale rados connection. I would probably give the pool 10 PGs instead of 1, but that's up to you. I hope I find time today to look at the incomplete PG. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 21 October 2020 22:58:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/21/20 6:47 AM, Frank Schilder wrote: Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are: # osdmaptool osd.map --test-map-pgs-dump --pool 1 https://pastebin.com/seh6gb7R As I suspected, it thinks that OSDs 0, 41 are the acting set. or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm. In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly. That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks. In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is # ceph daemon mon.MON-ID sessions that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module. If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable. Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output. I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :) I have OSDs ready to add to the pool, in case you think we should try. The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :). I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike ____ From: Michael Thomas Sent: 20 October 2020 23:48:36 To: F
[ceph-users] Re: multiple OSD crash, unfound objects
Done. I gave it 4 PGs (I read somewhere that PG counts should be multiples of 2), and restarted the mgr. I still don't see any traffic to the pool, though I'm also unsure how much traffic is to be expected. --Mike On 10/22/20 2:32 AM, Frank Schilder wrote: Sounds good. Did you re-create the pool again? If not, please do to give the devicehealth manager module its storage. In case you can't see any IO, it might be necessary to restart the MGR to flush out a stale rados connection. I would probably give the pool 10 PGs instead of 1, but that's up to you. I hope I find time today to look at the incomplete PG. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 21 October 2020 22:58:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/21/20 6:47 AM, Frank Schilder wrote: Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are: # osdmaptool osd.map --test-map-pgs-dump --pool 1 https://pastebin.com/seh6gb7R As I suspected, it thinks that OSDs 0, 41 are the acting set. or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm. In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly. That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks. In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is # ceph daemon mon.MON-ID sessions that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module. If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable. Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output. I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :) I have OSDs ready to add to the pool, in case you think we should try. The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :). I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike ____ From: Michael Thomas Sent: 20 October 2020 23:48:36 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/20/20 1:18 PM, Frank Schilder wrote: Dear Michael, Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? I meant here with crush rule replicated_host_nvme. Sorry, forgot. Seems to have worked fine: https://pastebin.com/PFgDE4J1 Yes, the OSD was still out when the previous health report was created. Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster: from https://pastebin.co
[ceph-users] Re: multiple OSD crash, unfound objects
On 10/21/20 6:47 AM, Frank Schilder wrote: Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are: # osdmaptool osd.map --test-map-pgs-dump --pool 1 https://pastebin.com/seh6gb7R As I suspected, it thinks that OSDs 0, 41 are the acting set. or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm. In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly. That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks. In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is # ceph daemon mon.MON-ID sessions that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module. If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable. Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output. I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :) I have OSDs ready to add to the pool, in case you think we should try. The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :). I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike ____ From: Michael Thomas Sent: 20 October 2020 23:48:36 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/20/20 1:18 PM, Frank Schilder wrote: Dear Michael, Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? I meant here with crush rule replicated_host_nvme. Sorry, forgot. Seems to have worked fine: https://pastebin.com/PFgDE4J1 Yes, the OSD was still out when the previous health report was created. Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster: from https://pastebin.com/3G3ij9ui: [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have slow ops. Not sure what to make of that. It looks almost like you have a ghost osd.41. I think (some of) the slow ops you are seeing are directed to the health_metrics pool and can be ignored. If it is too annoying, you could try to find out who runs the client with IDs client.7524484 and disable it. Might be an MGR module. I'm also pretty certain that the slow ops are related to the health metrics pool, which is why I've been ignoring them. What I'm not sure about is whether re-creating the device_health_metrics pool will cause any problems in the ceph cluster. Looking at the data you provided and also some older threads of yours (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start considering that we are looking at the fall-out of a past admin
[ceph-users] Re: multiple OSD crash, unfound objects
On 10/20/20 1:18 PM, Frank Schilder wrote: Dear Michael, Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? I meant here with crush rule replicated_host_nvme. Sorry, forgot. Seems to have worked fine: https://pastebin.com/PFgDE4J1 Yes, the OSD was still out when the previous health report was created. Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster: from https://pastebin.com/3G3ij9ui: [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons [osd.0,osd.41] have slow ops. Not sure what to make of that. It looks almost like you have a ghost osd.41. I think (some of) the slow ops you are seeing are directed to the health_metrics pool and can be ignored. If it is too annoying, you could try to find out who runs the client with IDs client.7524484 and disable it. Might be an MGR module. I'm also pretty certain that the slow ops are related to the health metrics pool, which is why I've been ignoring them. What I'm not sure about is whether re-creating the device_health_metrics pool will cause any problems in the ceph cluster. Looking at the data you provided and also some older threads of yours (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start considering that we are looking at the fall-out of a past admin operation. A possibility is, that an upmap for PG 1.0 exists that conflicts with the crush rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. This result is an empty set. So var I've been unable to locate the client with the ID 7524484. It's not showing up in the manager dashboard -> Filesystems page, nor in the output of 'ceph tell mds.ceph1 client ls'. I'm digging through the compress logs for the past week to see if I can find the culprit. I couldn't really find a simple command to list up-maps. The only non-destructive way seems to be to extract the osdmap and create a clean-up command file. The cleanup file should contain a command for every PG with an upmap. To check this, you can execute (see also https://docs.ceph.com/en/latest/man/8/osdmaptool/) # ceph osd getmap > osd.map # osdmaptool osd.map --upmap-cleanup cleanup.cmd If you do this, could you please post as usual the contents of cleanup.cmd? It was empty: [root@ceph1 ~]# ceph osd getmap > osd.map got osdmap epoch 52833 [root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd osdmaptool: osdmap file 'osd.map' writing upmap command output to: cleanup.cmd checking for upmap cleanups [root@ceph1 ~]# wc cleanup.cmd 0 0 0 cleanup.cmd Also, with the OSD map of your cluster, you can simulate certain admin operations and check resulting PG mappings for pools and other things without having to touch the cluster; see https://docs.ceph.com/en/latest/man/8/osdmaptool/. To dig a little bit deeper, could you please post as usual the output of: - ceph pg 1.0 query - ceph pg 7.39d query Oddly, it claims that it doesn't have pgid 1.0. https://pastebin.com/pHh33Dq7 It would also be helpful if you could post the decoded crush map. You can get the map as a txt-file as follows: # ceph osd getcrushmap -o crush-orig.bin # crushtool -d crush-orig.bin -o crush.txt and post the contents of file crush.txt. https://pastebin.com/EtEGpWy3 Did the slow MDS request complete by now? Nope. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, I'll give both of these a try and let you know what happens. Thanks again for your help, --Mike On 10/16/20 12:35 PM, Frank Schilder wrote: Dear Michael, this is a bit of a nut. I can't see anything obvious. I have two hypotheses that you might consider testing. 1) Problem with 1 incomplete PG. In the shadow hierarchy for your cluster I can see quite a lot of nodes like { "id": -135, "name": "node229~hdd", "type_id": 1, "type_name": "host", "weight": 0, "alg": "straw2", "hash": "rjenkins1", "items": [] }, I would have expected that hosts without a device of a certain device class are *excluded* completely from a tree instead of having weight 0. I'm wondering if this could lead to the crush algorithm fail in the way described here: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon . This might be a long shot, but could you export your crush map and play with the tunables as described under this link to see if more tries lead to a valid mapping? Note that testing this is harmless and does not change anything on the cluster. > The hypothesis here is that buckets with weight 0 are not excluded from drawing a-priori, but a-posteriori. If there are too many draws of an empty bucket, a mapping fails. Allowing more tries should then lead to success. We should at least rule out this possibility. 2) About the incomplete PG. I'm wondering if the problem is that the pool has exactly 1 PG. I don't have a test pool with Nautilus and cannot try this out. Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? If not, can you then increase pg_num and pgp_num to, say, 10 and see if this has any effect? I'm wondering here if there needs to be a minimum number >1 of PGs in a pool. Again, this is more about ruling out a possibility than expecting success. As an extension to this test, you could increase pg_num and pgp_num of the pool device_health_metrics to see if this has any effect. The crush rules and crush tree look OK to me. I can't really see why the missing OSDs are not assigned to the two PGs 1.0 and 7.39d. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:41:29 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, Please mark OSD 41 as "in" again and wait for some slow ops to show up. I forgot. "wait for some slow ops to show up" ... and then what? Could you please go to the host of the affected OSD and look at the output of "ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and check what type of operations get stuck? I'm wondering if its administrative, like peering attempts. Best regards, ========= Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:09:20 To: Michael Thomas; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, thanks for this initial work. I will need to look through the files you posted in more detail. In the meantime: Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but they then started piling up again. From the OSD log it looks like an operation that is sent to/from PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve this issue (later). Its a bit strange that I see slow ops for OSD 41 in the latest health detail (https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report was created? I think we might have misunderstood my question 6. My question was whether or not each host bucket corresponds to a physical host and vice versa, that is, each physical host has exactly 1 host bucket. I'm asking because it is possible to have multiple host buckets assigned to a single physical host and this has implications on how to manage things. Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I can see), the problem is that is has no OSDs assigned. I need to look a bit longer at the data you uploaded to find out why. I can't see anything obvious. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 16 October 2020 02:08:01 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD
[ceph-users] Re: multiple OSD crash, unfound objects
On 10/14/20 3:49 PM, Frank Schilder wrote: Hi Michael, it doesn't look too bad. All degraded objects are due to the undersized PG. If this is an EC pool with m>=2, data is currently not in danger. I see a few loose ends to pick up, let's hope this is something simple. For any of the below, before attempting the next step, please wait until all induced recovery IO has completed before continuing. 1) Could you please paste the output of the following commands to pastebin (bash syntax): ceph osd pool get device_health_metrics all https://pastebin.com/6D83mjsV ceph osd pool get fs.data.archive.frames all https://pastebin.com/7XAaQcpC ceph pg dump |& grep -i -e PG_STAT -e "^7.39d" https://pastebin.com/tBLaq63Q ceph osd crush rule ls https://pastebin.com/6f5B778G ceph osd erasure-code-profile ls https://pastebin.com/uhAaMH1c ceph osd crush dump # this is a big one, please be careful with copy-paste (see point 3 below) https://pastebin.com/u92D23jV 2) I don't see any IO reported (neither user nor recovery). Could you please confirm that the command outputs were taken during a zero-IO period? That's correct, there was no activity at this time. Access to the cephfs filesystem is very bursty, varying from completely idle to multiple GB/s (read). 3) Something is wrong with osd.41. Can you check its health status with smartctl? If it is reported healthy, give it one more clean restart. If the slow ops do not disappear, it could be a disk fail that is not detected by health monitoring. You could set it to "out" and see if the cluster recovers to a healthy state (modulo the currently degraded objects) with no slow ops. If so, I would replace the disk. smartctl reports no problems. osd.41 (and osd.0) was one of the original OSDs used for the device_health_metrics pool. Early on, before I knew better, I had removed this OSD (and osd.0) from the cluster, and the OSD ids got recycled when new disks were later added. This is when the slow ops on osd.0 and osd.41 started getting reported. On advice from another user on ceph-users, I updated my crush map to remap the device_health_metrics pool to a different set of OSDs (and the slow ops persisted). osd.0 usually also shows slow ops. I was a little surprised that it didn't when I took this snapshot, but now it does. I have now run 'ceph osd out 41', and the recovery I/O has finished. With the exception of one less OSD marked in, the output of 'ceph status' looks the same. The last few lines of the osd.41 logfile are here: https://pastebin.com/k06aArW4 How long does it take for ceph to clear the slow ops status? 4) In the output of "df tree" node141 shows up twice. Could you confirm that this is a copy-paste error or is this node indeed twice in the output? This is easiest to see in the pastebin when switching to "raw" view. This was a copy/paste error. 5) The crush tree contains an empty host bucket (node308). Please delete this host bucket (ceph osd crush rm node308) for now and let me know if this caused any data movements (recovery IO). This did not cause any data movement, according to 'ceph status'. 6) The crush tree looks a bit exotic. Do the nodes with a single OSD correspond to a physical host with 1 OSD disk? If not, could you please state how the host buckets are mapped onto physical hosts? Each OSD corresponds to a single physical disk. Hosts may have 1, 2 or 3 OSDs of varying types (HDD, SSD, or SSD+NVME). There are a few different crush types used in the cluster: 3 x replicated nvme - used for cephfs metadata 3 x replicated SSD - used for ovirt block storage EC HDD - used for the bulk of the experiment data EC SSD - used for frequently accessed experiment data 7) In case there was a change to the health status, could you please include an updated "ceph health detail"? Looks like the only difference is a new slow MDS op, and one PG that hasn't been deep scrubbed in the last week: https://pastebin.com/3G3ij9ui --Mike I don't expect to get the incomplete PG resolved with the above, but it will move some issues out of the way before proceeding. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 14 October 2020 20:52:10 To: Andreas John; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Hello, The original cause of the OSD instability has already been fixed. It was due to user jobs (via condor) consuming too much memory and causing the machine to swap. The OSDs didn't actually crash, but weren't responding in time and were being flagged as down. In most cases, the problematic OSD servers were also not responding on the console and had to be physically power cycled to recover. Since adding additional memory limits to user jobs, we have only had 1 or
[ceph-users] Re: multiple OSD crash, unfound objects
Hello, The original cause of the OSD instability has already been fixed. It was due to user jobs (via condor) consuming too much memory and causing the machine to swap. The OSDs didn't actually crash, but weren't responding in time and were being flagged as down. In most cases, the problematic OSD servers were also not responding on the console and had to be physically power cycled to recover. Since adding additional memory limits to user jobs, we have only had 1 or 2 unstable OSDs that were fixed by killing the remaining rogue user jobs. Regards, --Mike On 10/10/20 9:22 AM, Andreas John wrote: > Hello Mike, > > do your OSDs go down from time to time? I once has an issue with > unrecoverable objects, because I had only n+1 (size 2) redundancy and > ceph wasn't able to decide, what's the correct copy of the object. In my > case there half-deleted snapshots in one of the copies. I used > ceph-objectstoretool to remove the "wrong" part. Did you check you OSD > logs? Do the osd go down wirth an obscure stacktrace (and maybe they are > restartet by systemd ...) > > rgds, > > j. > > > > On 09.10.20 22:33, Michael Thomas wrote: >> Hi Frank, >> >> That was a good tip. I was able to move the broken files out of the >> way and restore them for users. However, after 2 weeks I'm still left >> with unfound objects. Even more annoying, I now have 82k objects >> degraded (up from 74), which hasn't changed in over a week. >> >> I'm ready to claim that the auto-repair capabilities of ceph are not >> able to fix my particular issues, and will have to continue to >> investigate alternate ways to clean this up, including a pg >> export/import (as you suggested) and perhaps a mds backward scrub >> (after testing in a junk pool first). >> >> I have other tasks I need to perform on the filesystem (removing OSDs, >> adding new OSDs, increasing PG count), but I feel like I need to >> address these degraded/lost objects before risking any more damage. >> >> One particular PG is in a curious state: >> >> 7.39d 82163 82165 246734 1 344060777807 0 >> 0 2139 active+recovery_unfound+undersized+degraded+remapped 23m >> 50755'112549 50766:960500 [116,72,122,48,45,131,73,81]p116 >> [71,109,99,48,45,90,73,NONE]p71 2020-08-13T23:02:34.325887-0500 >> 2020-08-07T11:01:45.657036-0500 >> >> Note the 'NONE' in the acting set. I do not know which OSD this may >> have been, nor how to find out. I suspect (without evidence) that >> this is part of the cause of no action on the degraded and misplaced >> objects. >> >> --Mike >> >> On 9/18/20 11:26 AM, Frank Schilder wrote: >>> Dear Michael, >>> >>> maybe there is a way to restore access for users and solve the issues >>> later. Someone else with a lost/unfound object was able to move the >>> affected file (or directory containing the file) to a separate >>> location and restore the now missing data from backup. This will >>> "park" the problem of cluster health for later fixing. >>> >>> Best regads, >>> = >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> >>> From: Frank Schilder >>> Sent: 18 September 2020 15:38:51 >>> To: Michael Thomas; ceph-users@ceph.io >>> Subject: [ceph-users] Re: multiple OSD crash, unfound objects >>> >>> Dear Michael, >>> >>>> I disagree with the statement that trying to recover health by deleting >>>> data is a contradiction. In some cases (such as mine), the data in >>>> ceph >>>> is backed up in another location (eg tape library). Restoring a few >>>> files from tape is a simple and cheap operation that takes a minute, at >>>> most. >>> >>> I would agree with that if the data was deleted using the appropriate >>> high-level operation. Deleting an unfound object is like marking a >>> sector on a disk as bad with smartctl. How should the file system >>> react to that? Purging an OSD is like removing a disk from a raid >>> set. Such operations increase inconsistencies/degradation rather than >>> resolving them. Cleaning this up also requires to execute other >>> operations to remove all references to the object and, finally, the >>> file inode itself. >>> >>> The ls on a dir with corrupted file(s) hangs if ls calls stat on >>> every file. For example, when coloring is enabled, ls will
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, Thanks for taking the time to help out with this. Here is the output you requested: ceph status: https://pastebin.com/v8cJJvjm ceph health detail: https://pastebin.com/w9wWLGiv ceph osd pool stats: https://pastebin.com/dcJTsXE1 ceph osd df tree: https://pastebin.com/LaZcBemC I removed one object following a troubleshooting guide, and removed one OSD (weeks ago) as part of a server upgrade. I have not removed any PGs. A couple of notes about some of the output: The '1 pg inactive' is from the 'device_health_metrics' pool. It has been broken since the very beginning of my ceph deployment. Fixing this would be nice, but not the focus of my current issues. The 8 PGs that have not been deep-scrubbed are the same ones that are marked "recovery_unfound". I suspect that ceph won't deep scrub these until they are active+clean. I have restarted (systemctl restart ceph-osd@XXX) and rebooted (init 6) all OSDs (one at a time) for PG 7.39d, with no change in the number of degraded objects. The only difference in the 'ceph status' output before and after the object removal is the number of degraded objects (went down by 1) and degraded PGs (went down by 1). Regards, --Mike On 10/10/20 5:14 AM, Frank Schilder wrote: > Dear Michael, > >> I have other tasks I need to perform on the filesystem (removing OSDs, >> adding new OSDs, increasing PG count), but I feel like I need to address >> these degraded/lost objects before risking any more damage. > > I would probably not attempt any such maintenance before there was a period > of at least 1 day with HEALTH_OK. The reason is that certain historical > information is not trimmed unless the cluster is in HEALTH_OK. The more such > information is accumulated, the more risk one runs that a cluster becomes > unstable. > > Can you post the output of ceph status, ceph health detail, ceph osd pool > stats and ceph osd df tree (on pastebin.com)? If I remember correctly, you > removed OSDs/PGs following a trouble-shooting guide? I suspect that the > removal has left something in an inconsistent state that requires manual > clean up for recovery to proceed. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Michael Thomas > Sent: 09 October 2020 22:33:46 > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects > > Hi Frank, > > That was a good tip. I was able to move the broken files out of the way > and restore them for users. However, after 2 weeks I'm still left with > unfound objects. Even more annoying, I now have 82k objects degraded > (up from 74), which hasn't changed in over a week. > > I'm ready to claim that the auto-repair capabilities of ceph are not > able to fix my particular issues, and will have to continue to > investigate alternate ways to clean this up, including a pg > export/import (as you suggested) and perhaps a mds backward scrub (after > testing in a junk pool first). > > I have other tasks I need to perform on the filesystem (removing OSDs, > adding new OSDs, increasing PG count), but I feel like I need to address > these degraded/lost objects before risking any more damage. > > One particular PG is in a curious state: > > 7.39d82163 82165 2467341 3440607778070 > >0 2139 active+recovery_unfound+undersized+degraded+remapped > 23m 50755'112549 50766:960500 [116,72,122,48,45,131,73,81]p116 >[71,109,99,48,45,90,73,NONE]p71 2020-08-13T23:02:34.325887-0500 > 2020-08-07T11:01:45.657036-0500 > > Note the 'NONE' in the acting set. I do not know which OSD this may > have been, nor how to find out. I suspect (without evidence) that this > is part of the cause of no action on the degraded and misplaced objects. > > --Mike > > On 9/18/20 11:26 AM, Frank Schilder wrote: >> Dear Michael, >> >> maybe there is a way to restore access for users and solve the issues later. >> Someone else with a lost/unfound object was able to move the affected file >> (or directory containing the file) to a separate location and restore the >> now missing data from backup. This will "park" the problem of cluster health >> for later fixing. >> >> Best regads, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Frank Schilder >> Sent: 18 September 2020 15:38:51 >> To: Michael Thomas; ceph-users@ceph.io >> Subject: [ceph-users] Re: multiple OSD crash, unfound objects >> >> Dear Michael, >> >>> I
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, That was a good tip. I was able to move the broken files out of the way and restore them for users. However, after 2 weeks I'm still left with unfound objects. Even more annoying, I now have 82k objects degraded (up from 74), which hasn't changed in over a week. I'm ready to claim that the auto-repair capabilities of ceph are not able to fix my particular issues, and will have to continue to investigate alternate ways to clean this up, including a pg export/import (as you suggested) and perhaps a mds backward scrub (after testing in a junk pool first). I have other tasks I need to perform on the filesystem (removing OSDs, adding new OSDs, increasing PG count), but I feel like I need to address these degraded/lost objects before risking any more damage. One particular PG is in a curious state: 7.39d82163 82165 2467341 3440607778070 0 2139 active+recovery_unfound+undersized+degraded+remapped 23m 50755'112549 50766:960500 [116,72,122,48,45,131,73,81]p116 [71,109,99,48,45,90,73,NONE]p71 2020-08-13T23:02:34.325887-0500 2020-08-07T11:01:45.657036-0500 Note the 'NONE' in the acting set. I do not know which OSD this may have been, nor how to find out. I suspect (without evidence) that this is part of the cause of no action on the degraded and misplaced objects. --Mike On 9/18/20 11:26 AM, Frank Schilder wrote: Dear Michael, maybe there is a way to restore access for users and solve the issues later. Someone else with a lost/unfound object was able to move the affected file (or directory containing the file) to a separate location and restore the now missing data from backup. This will "park" the problem of cluster health for later fixing. Best regads, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 18 September 2020 15:38:51 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, I disagree with the statement that trying to recover health by deleting data is a contradiction. In some cases (such as mine), the data in ceph is backed up in another location (eg tape library). Restoring a few files from tape is a simple and cheap operation that takes a minute, at most. I would agree with that if the data was deleted using the appropriate high-level operation. Deleting an unfound object is like marking a sector on a disk as bad with smartctl. How should the file system react to that? Purging an OSD is like removing a disk from a raid set. Such operations increase inconsistencies/degradation rather than resolving them. Cleaning this up also requires to execute other operations to remove all references to the object and, finally, the file inode itself. The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. For example, when coloring is enabled, ls will stat every file in the dir to be able to choose the color according to permissions. If one then disables coloring, a plain "ls" will return all names while an "ls -l" will hang due to stat calls. An "rm" or "rm -f" should succeed if the folder permissions allow that. It should not stat the file itself, so it sounds a bit odd that its hanging. I guess in some situations it does, like "rm -i", which will ask before removing read-only files. How does "unlink FILE" behave? Most admin commands on ceph are asynchronous. A command like "pg repair" or "osd scrub" only schedules an operation. The command "ceph pg 7.1fb mark_unfound_lost delete" does probably just the same. Unfortunately, I don't know how to check that a scheduled operation has started/completed/succeeded/failed. I asked this in an earlier thread (about PG repair) and didn't get an answer. On our cluster, the actual repair happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that (some of) these operations have very low priority and will not start at least as long as there is recovery going on. One might want to consider the possibility that some of the scheduled commands have not been executed yet. The output of "pg query" contains the IDs of the missing objects (in mimic) and each of these objects is on one of the peer OSDs of the PG (I think object here refers to shard or copy). It should be possible to find the corresponding OSD (or at least obtain confirmation that the object is really gone) and move the object to a place where it is expected to be found. This can probably be achieved with "PG export" and "PG import". I don't know of any other way(s). I guess, in the current situation, sitting it out a bit longer might be a good strategy. I don't know how many asynchronous commands you executed and giving the cluster time to complete the
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank, Yes, it does sounds similar to your ticket. I've tried a few things to restore the failed files: * Locate a missing object with 'ceph pg $pgid list_unfound' * Convert the hex oid to a decimal inode number * Identify the affected file with 'find /ceph -inum $inode' At this point, I know which file is affected by the missing object. As expected, attempts to read the file simply hang. Unexpectedly, attempts to 'ls' the file or its containing directory also hang. I presume from this that the stat() system call needs some information that is contained in the missing object, and is waiting for the object to become available. Next I tried to remove the affected object with: * ceph pg $pgid mark_unfound_lost delete Now 'ceph status' shows one fewer missing objects, but attempts to 'ls' or 'rm' the affected file continue to hang. Finally, I ran a scrub over the part of the filesystem containing the affected file: ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive Nothing seemed to come up during the scrub: 2020-09-17T14:56:15.208-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T14:58:58.013-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub start {path=/frames/postO3/hoft,prefix=scrub start,scrubops=[recursive]} (starting...) 2020-09-17T14:58:58.013-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: active 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub queued for path: /frames/postO3/hoft 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: active [paths:/frames/postO3/hoft] 2020-09-17T14:59:02.535-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T15:00:12.520-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T15:02:32.944-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: idle 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a' 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub completed for path: /frames/postO3/hoft 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: idle After the scrub completed, access to the file (ls or rm) continue to hang. The MDS reports slow reads: 2020-09-17T15:11:05.654-0500 7f39b9a1e700 0 log_channel(cluster) log [WRN] : slow request 481.867381 seconds old, received at 2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309 getattr pAsLsXsFs #0x105b1c0 2020-09-17T15:03:03.787602-0500 caller_uid=0, caller_gid=0{}) currently dispatched Does anyone have any suggestions on how else to clean up from a permanently lost object? --Mike On 9/16/20 2:03 AM, Frank Schilder wrote: Sounds similar to this one: https://tracker.ceph.com/issues/46847 If you have or can reconstruct the crush map from before adding the OSDs, you might be able to discover everything with the temporary reversal of the crush map method. Not sure if there is another method, i never got a reply to my question in the tracker. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 16 September 2020 01:27:19 To: ceph-users@ceph.io Subject: [ceph-users] multiple OSD crash, unfound objects Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added more OSDs to the cluster. After bringing all of the OSDs back up, I have 25 unfound objects and 75 degraded objects. There are other problems reported, but I'm primarily concerned with these unfound/degraded objects. The pool with the missing objects is a cephfs pool. The files stored in the pool are backed up on tape, so I can easily restore individual files as needed (though I would not want to restore the entire filesystem). I tried following the guide at https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects. I found a number of OSDs that are still 'not queried'. Restarting a sampling of these OSDs changed the state from 'not queried' to 'already probed', but that did not recover any of the unfound or degraded objects. I have also tried 'ceph pg deep-scrub' on the affected PGs, but never saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on the affected PGs, but only one seems to have been tagged accordingly (see ceph -s output below). The guide also says "Sometimes it simply takes some time for the cluster to query possible locations." I'm no
[ceph-users] multiple OSD crash, unfound objects
Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added more OSDs to the cluster. After bringing all of the OSDs back up, I have 25 unfound objects and 75 degraded objects. There are other problems reported, but I'm primarily concerned with these unfound/degraded objects. The pool with the missing objects is a cephfs pool. The files stored in the pool are backed up on tape, so I can easily restore individual files as needed (though I would not want to restore the entire filesystem). I tried following the guide at https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects. I found a number of OSDs that are still 'not queried'. Restarting a sampling of these OSDs changed the state from 'not queried' to 'already probed', but that did not recover any of the unfound or degraded objects. I have also tried 'ceph pg deep-scrub' on the affected PGs, but never saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on the affected PGs, but only one seems to have been tagged accordingly (see ceph -s output below). The guide also says "Sometimes it simply takes some time for the cluster to query possible locations." I'm not sure how long "some time" might take, but it hasn't changed after several hours. My questions are: * Is there a way to force the cluster to query the possible locations sooner? * Is it possible to identify the files in cephfs that are affected, so that I could delete only the affected files and restore them from backup tapes? --Mike ceph -s: cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_ERR 1 clients failing to respond to capability release 1 MDSs report slow requests 25/78520351 objects unfound (0.000%) 2 nearfull osd(s) Reduced data availability: 1 pg inactive Possible data damage: 9 pgs recovery_unfound Degraded data redundancy: 75/626645098 objects degraded (0.000%), 9 pgs degraded 1013 pgs not deep-scrubbed in time 1013 pgs not scrubbed in time 2 pool(s) nearfull 1 daemons have recently crashed 4 slow ops, oldest one blocked for 77939 sec, daemons [osd.0,osd.41] have slow ops. services: mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d) mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1 mds: archive:1 {0=ceph4=up:active} 3 up:standby osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs task status: scrub status: mds.ceph4: idle data: pools: 9 pools, 2433 pgs objects: 78.52M objects, 298 TiB usage: 412 TiB used, 545 TiB / 956 TiB avail pgs: 0.041% pgs unknown 75/626645098 objects degraded (0.000%) 135224/626645098 objects misplaced (0.022%) 25/78520351 objects unfound (0.000%) 2421 active+clean 5active+recovery_unfound+degraded 3active+recovery_unfound+degraded+remapped 2active+clean+scrubbing+deep 1unknown 1active+forced_recovery+recovery_unfound+degraded progress: PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d) [] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pg stuck in unknown state
On 8/11/20 2:52 AM, Wido den Hollander wrote: On 11/08/2020 00:40, Michael Thomas wrote: On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state. It appears to belong to the device_health_metrics pool, which was created automatically by the mgr daemon(?). The OSDs that the PG maps to are all online and serving other PGs. But when I list the PGs that belong to the OSDs from 'ceph pg map', the offending PG is not listed. # ceph pg dump pgs | grep ^1.0 dumped pgs 1.0 0 0 0 0 0 0 0 0 0 0 unknown 2020-08-08T09:30:33.251653-0500 0'0 0:0 [] -1 [] -1 0'0 2020-08-08T09:30:33.251653-0500 0'0 2020-08-08T09:30:33.251653-0500 0 # ceph osd pool stats device_health_metrics pool device_health_metrics id 1 nothing is going on # ceph pg map 1.0 osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0] What can be done to fix the PG? I tried doing a 'ceph pg repair 1.0', but that didn't seem to do anything. Is it safe to try to update the crush_rule for this pool so that the PG gets mapped to a fresh set of OSDs? Yes, it would be. But still, it's weird. Mainly as the acting set is so different from the up-set. You have different CRUSH rules I think? Marking those OSDs down might work, but otherwise change the crush_rule and see how that goes. Yes, I do have different crush rules to help map certain types of data to different classes of hardware (EC HDDs, replicated SSDs, replicated nvme). The default crush rule for the device_health_metrics pool was to use replication across any storage device. I changed it to use the replicated nvme crush rule, and now the map looks different: # ceph pg map 1.0 osdmap e7256 pg 1.0 (1.0) -> up [24,22,12] acting [41,0] However, the acting set of OSDs has not changed. --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg stuck in unknown state
On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state. It appears to belong to the device_health_metrics pool, which was created automatically by the mgr daemon(?). The OSDs that the PG maps to are all online and serving other PGs. But when I list the PGs that belong to the OSDs from 'ceph pg map', the offending PG is not listed. # ceph pg dump pgs | grep ^1.0 dumped pgs 1.00 0 0 00 00 0 0 0 unknown 2020-08-08T09:30:33.251653-0500 0'0 0:0 [] -1 [] -1 0'0 2020-08-08T09:30:33.251653-0500 0'0 2020-08-08T09:30:33.251653-0500 0 # ceph osd pool stats device_health_metrics pool device_health_metrics id 1 nothing is going on # ceph pg map 1.0 osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0] What can be done to fix the PG? I tried doing a 'ceph pg repair 1.0', but that didn't seem to do anything. Is it safe to try to update the crush_rule for this pool so that the PG gets mapped to a fresh set of OSDs? --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io