[ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Manuel Lausch
Hi, I see a issue with systemd's restart behaviour and disk IO-errors If a disk fails with IO-errors ceph-osd stops running. Systemd detects this and starts the daemon again. In our cluster I did see some loops with osd crashes caused by disk failure and restarts triggerd by systemd. Every time w

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Wido den Hollander
> Op 19 september 2017 om 10:02 schreef Manuel Lausch : > > > Hi, > > I see a issue with systemd's restart behaviour and disk IO-errors > If a disk fails with IO-errors ceph-osd stops running. Systemd detects > this and starts the daemon again. In our cluster I did see some loops > with osd cr

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Adrian Saul
> I understand what you mean and it's indeed dangerous, but see: > https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service > > Looking at the systemd docs it's difficult though: > https://www.freedesktop.org/software/systemd/man/systemd.service.ht > ml > > If the OSD crashes due to ano

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Manuel Lausch
Am Tue, 19 Sep 2017 08:24:48 + schrieb Adrian Saul : > > I understand what you mean and it's indeed dangerous, but see: > > https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service > > > > Looking at the systemd docs it's difficult though: > > https://www.freedesktop.org/software/s

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Wido den Hollander
> Op 19 september 2017 om 10:24 schreef Adrian Saul > : > > > > I understand what you mean and it's indeed dangerous, but see: > > https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service > > > > Looking at the systemd docs it's difficult though: > > https://www.freedesktop.org/soft

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Stanley Zhang
I like this, there is some similar ideas we probably can borrow from Cassandra on disk failure # policy for data disk failures: # die: shut down gossip and Thrift and kill the JVM for any fs errors or #  single-sstable errors, so the node can be replaced. # stop_paranoid: shut down gossip a

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-20 Thread Matthew Vernon
On 19/09/17 10:40, Wido den Hollander wrote: > >> Op 19 september 2017 om 10:24 schreef Adrian Saul >> : >> >> >>> I understand what you mean and it's indeed dangerous, but see: >>> https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service >>> >>> Looking at the systemd docs it's diffi