Hi Jan, On 01.12.21 17:31, Jan Kasprzak wrote:
In "ceph -s", they "2 osds down" message disappears, and the number of degraded objects steadily decreases. However, after some time the number of degraded objects starts going up and down again, and osds appear to be down (and then up again). After 5 minutes the OSDs are kicked out from the cluster, and the ceph-osd daemons stop Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt *** Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) ***
Do you have enough memory on your host? You might want to look for oom messages in dmesg / journal and monitor your memory usage throughout the recovery.
If the osd processes are indeed killed by OOM killer, you have a few options. Adding more memory would probably be best to future-proof the system. Maybe you could also work with some Ceph config setting, e.g. lowering osd_max_backfills (although I'm definitely not an expert on which parameters would give you the best result). Adding swap will most likely only produce other issues, but might be a method of last resort.
Cheers Sebastian _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io