Quoting Dan van der Ster (d...@vanderster.com):
>
> So, first question is: why didn't that OSD get detected as failing
> much earlier?
We have notiticed that "mon osd adjust heartbeat grace" made the cluster
"realize" OSDs going down _much_ later than the MONs / OSDs themselves.
Setting this
On Mon, Jan 22, 2018 at 8:46 PM, Dan van der Ster wrote:
> Here's a bit more info as I read the logs. Firstly, these are in fact
> Filestore OSDs... I was confused, but I don't think it makes a big
> difference.
>
> Next, all the other OSDs had indeed noticed that osd.2 had
Here's a bit more info as I read the logs. Firstly, these are in fact
Filestore OSDs... I was confused, but I don't think it makes a big
difference.
Next, all the other OSDs had indeed noticed that osd.2 had failed:
2018-01-22 18:37:20.456535 7f831728e700 -1 osd.0 598 heartbeat_check:
no reply
Hi all,
We just saw an example of one single down OSD taking down a whole
(small) luminous 12.2.2 cluster.
The cluster has only 5 OSDs, on 5 different servers. Three of those
servers also run a mon/mgr combo.
First, we had one server (mon+osd) go down legitimately [1] -- I can
tell when it went