Thanks Frank.

I see it the same way. I’ll be sure to create a ticket with all the details and 
steps to reproduce the issue.

Denis

> On 24 Nov 2023, at 10:24, Frank Schilder <fr...@dtu.dk> wrote:
> 
> Hi Denis,
> 
> I would agree with you that a single misconfigured host should not take out 
> healthy hosts under any circumstances. I'm not sure if your incident is 
> actually covered by the devs comments, it is quite possible that you observed 
> an unintended side effect that is a bug in handling the connection error. I 
> think the intention is to shut down fast the OSDs with connection refused 
> (where timeouts are not required) and not other OSDs.
> 
> A bug report with tracker seems warranted.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Denis Krienbühl <de...@href.ch>
> Sent: Friday, November 24, 2023 9:01 AM
> To: ceph-users
> Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered
> 
> Hi
> 
> We’ve recently had a serious outage at work, after a host had a network 
> problem:
> 
> - We rebooted a single host in a cluster of fifteen hosts across three racks.
> - The single host had a bad network configuration after booting, causing it 
> to send some packets to the wrong network.
> - One network still worked and offered a connection to the mons.
> - The other network connection was bad. Packets were refused, not dropped.
> - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
> mons to take all other OSDs down (immediate failure).
> - Only after shutting down the faulty host, was it possible to start the shut 
> down OSDs, to restore the cluster.
> 
> We have since solved the problem by removing the default route that caused 
> the packets to end up in the wrong network, where they were summarily 
> rejected by a firewall. That is, we made sure that packets would be dropped 
> in the future, not rejected.
> 
> Still, I figured I’ll send this experience of ours to this mailing list, as 
> this seems to be something others might encounter as well.
> 
> In the following PR, that introduced osd_fast_fail_on_connection_refused, 
> there’s this description:
> 
>> This changeset adds additional handler (handle_refused()) to the dispatchers
>> and code that detects when connection attempt fails with ECONNREFUSED error
>> (connection refused) which is a clear indication that host is alive, but
>> daemon isn't, so daemons can instantly mark the other side as undoubtly
>> downed without the need for grace timer.
> 
> And this comment:
> 
>> As for flapping, we discussed it on ceph-devel ml
>> and came to conclusion that it requires either broken firewall or network
>> configuration to cause this, and these are more serious issues that should
>> be resolved first before worrying about OSDs flapping (either way, flapping
>> OSDs could be good for getting someone's attention).
> 
> https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558
> 
> It has left us wondering if these are the right assumptions. An ECONNREFUSED 
> condition can bring down a whole cluster, and I wonder if there should be 
> some kind of safe-guard to ensure that this is avoided. One badly configured 
> host should generally not be able do that, and if the packets are dropped, 
> instead of refused, the cluster notices that the OSD down reports come only 
> from one host, and acts accordingly.
> 
> What do you think? Does this warrant a change in Ceph? I’m happy to provide 
> details and create a ticket.
> 
> Cheers,
> 
> Denis
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to