[ceph-users] Re: OSD repeatedly marked down

Sebastian Knust Wed, 01 Dec 2021 08:47:47 -0800

Hi Jan,

On 01.12.21 17:31, Jan Kasprzak wrote:

In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by 
pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) 
***

Do you have enough memory on your host? You might want to look for oommessages in dmesg / journal and monitor your memory usage throughout therecovery.

If the osd processes are indeed killed by OOM killer, you have a fewoptions. Adding more memory would probably be best to future-proof thesystem. Maybe you could also work with some Ceph config setting, e.g.lowering osd_max_backfills (although I'm definitely not an expert onwhich parameters would give you the best result). Adding swap will mostlikely only produce other issues, but might be a method of last resort.


Cheers
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD repeatedly marked down

Reply via email to