Hi, I’m currently playing around with a little Ceph test cluster and I’m trying to understand why a down OSD won’t get marked out under certain conditions. It’s a three node cluster with three OSDs in each node, mon_osd_down_out_interval is set to 120 seconds. I’m running version 16.2.7. There are only replicated pools with the default CRUSH rules.
When I shut down a server, its OSDs are first marked down and then out after two minutes, as expected. But when I stop another OSD on one of the remaining nodes, it will never be marked out. The tree will look like this: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.08817 root default -5 0.02939 host ceph-test-01 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 6 hdd 0.00980 osd.6 down 1.00000 1.00000 -3 0.02939 host ceph-test-02 0 hdd 0.00980 osd.0 down 0 1.00000 3 hdd 0.00980 osd.3 down 0 1.00000 7 hdd 0.00980 osd.7 down 0 1.00000 -7 0.02939 host ceph-test-03 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 8 hdd 0.00980 osd.8 up 1.00000 1.00000 When I bring ceph-test-02 up again, osd.6 is marked out immediately. I also tried changing mon_osd_min_down_reporters to 1, but that didn’t change anything. I feel like this is working as intended and I’m missing something, so I hope somebody can clarify… Regards, Julian _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io