[ceph-users] Re: OSD Container keeps restarting after drive crash

Eugen Block Thu, 24 Feb 2022 23:46:43 -0800

Hi,

these are the defaults set by cephadm in Octopus and Pacific:


---snip---
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash {data_dir}/{fsid}/%i/unit.run
ExecStop=-{container_path} stop ceph-{fsid}-%i
ExecStopPost=-/bin/bash {data_dir}/{fsid}/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
---snip---

So there are StartLimit options.

What are other options to prevent OSD containers from trying torestart after a valid crash?

The question is how you determine a "valid" crash. I wouldn't want thefirst crash to result in an out OSD. First I would try to get behindthe root cause for the crash. Of course, if there are signs of a diskfailure it's only a matter of time until the OSD won't recover. Butsince there are a lot more things that could kill a process I wouldwant ceph to try to bring the OSDs back online. I think the defaultsare a valid compromise, although one might argue about the specificvalues, of course.


Regards,
Eugen


Zitat von "Frank de Bot (lists)" <li...@searchy.net>:

Hi,
I've a small ceph containerized cluster rolled out withceph-ansible. wal and db from each drive are on a seperate nvmedrive, the data is on spinning sas disks. The cluster is running16.2.7Today a disk failed, but not quite catastrophic. The block device ispresent, lvm metadata is good, but reading certain blocks gives'Sense: Unrecovered read error' in the syslog (smart is indicatingthe drive is failing). The OSD crashes on reading/writing.
But the container kept restarting and crashing until manualintervention was done. By doing this the faulty was flapping up anddown, causing the OSD not going out and not rebalancing the cluster.I could set StartLimitIntervalSec and StartLimitBurst in the osdservice file, but it's not there by default and I like to keepeverything as standard as possible.What are other options to prevent OSD containers from trying torestart after a valid crash?
Regards,

Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD Container keeps restarting after drive crash

Reply via email to