> On 24 Nov 2023, at 11:49, Burkhard Linke 
> <burkhard.li...@computational.bio.uni-giessen.de> wrote:
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.

In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph

The mon then propagates that failure, without taking any other reports into 

ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to