Hi,

these are the defaults set by cephadm in Octopus and Pacific:

---snip---
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash {data_dir}/{fsid}/%i/unit.run
ExecStop=-{container_path} stop ceph-{fsid}-%i
ExecStopPost=-/bin/bash {data_dir}/{fsid}/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
---snip---

So there are StartLimit options.

What are other options to prevent OSD containers from trying to restart after a valid crash?

The question is how you determine a "valid" crash. I wouldn't want the first crash to result in an out OSD. First I would try to get behind the root cause for the crash. Of course, if there are signs of a disk failure it's only a matter of time until the OSD won't recover. But since there are a lot more things that could kill a process I would want ceph to try to bring the OSDs back online. I think the defaults are a valid compromise, although one might argue about the specific values, of course.

Regards,
Eugen


Zitat von "Frank de Bot (lists)" <li...@searchy.net>:

Hi,

I've a small ceph containerized cluster rolled out with ceph-ansible. wal and db from each drive are on a seperate nvme drive, the data is on spinning sas disks. The cluster is running 16.2.7 Today a disk failed, but not quite catastrophic. The block device is present, lvm metadata is good, but reading certain blocks gives 'Sense: Unrecovered read error' in the syslog (smart is indicating the drive is failing). The OSD crashes on reading/writing.

But the container kept restarting and crashing until manual intervention was done. By doing this the faulty was flapping up and down, causing the OSD not going out and not rebalancing the cluster. I could set StartLimitIntervalSec and StartLimitBurst in the osd service file, but it's not there by default and I like to keep everything as standard as possible. What are other options to prevent OSD containers from trying to restart after a valid crash?

Regards,

Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to