Re: [ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-23 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com): > > So, first question is: why didn't that OSD get detected as failing > much earlier? We have notiticed that "mon osd adjust heartbeat grace" made the cluster "realize" OSDs going down _much_ later than the MONs / OSDs themselves. Setting this

Re: [ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-23 Thread Gregory Farnum
On Mon, Jan 22, 2018 at 8:46 PM, Dan van der Ster wrote: > Here's a bit more info as I read the logs. Firstly, these are in fact > Filestore OSDs... I was confused, but I don't think it makes a big > difference. > > Next, all the other OSDs had indeed noticed that osd.2 had

Re: [ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-22 Thread Dan van der Ster
Here's a bit more info as I read the logs. Firstly, these are in fact Filestore OSDs... I was confused, but I don't think it makes a big difference. Next, all the other OSDs had indeed noticed that osd.2 had failed: 2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check: no reply

[ceph-users] Luminous: example of a single down osd taking out a cluster

2018-01-22 Thread Dan van der Ster
Hi all, We just saw an example of one single down OSD taking down a whole (small) luminous 12.2.2 cluster. The cluster has only 5 OSDs, on 5 different servers. Three of those servers also run a mon/mgr combo. First, we had one server (mon+osd) go down legitimately [1] -- I can tell when it went