[ceph-users] Not all OSDs in rack marked as down when the rack fails

Wido den Hollander Thu, 29 Oct 2020 10:05:24 -0700

Hi,

I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked asdown when the network is cut to that rack.


Situation:

- Nautilus cluster
- 3 racks
- 120 OSDs, 40 per rack

We performed a test where we turned off the network Top-of-Rack for eachrack. This worked as expected with two racks, but with the thirdsomething weird happened.

From the 40 OSDs which were supposed to be marked as down only 36 weremarked as down.


In the end it took 15 minutes for all 40 OSDs to be marked as down.

$ ceph config set mon mon_osd_reporter_subtree_level rack

That setting is set to make sure that we only accept reports from otherracks.


What we saw in the logs for example:

2020-10-29T03:49:44.409-0400 7fbda185e700 10mon.CEPH2-MON1-206-U39@0(leader).osd e107102 osd.51 has 54 reporters,239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since2020-10-29T03:47:22.374857-0400

But osd.51 was still not marked as down after 54 reporters have reportedthat it is actually down.


I checked, no ping or other traffic possible to osd.51. Host is unreachable.

Another osd was marked as down, but it took a couple of minutes as well:

2020-10-29T03:50:54.455-0400 7fbda185e700 10mon.CEPH2-MON1-206-U39@0(leader).osd e107102 osd.37 has 48 reporters,221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since2020-10-29T03:47:12.761584-04002020-10-29T03:50:54.455-0400 7fbda185e700 1mon.CEPH2-MON1-206-U39@0(leader).osd e107102 we have enough reportersto mark osd.37 down

In the end osd.51 was marked as down, but only after the MON decided todo so:

2020-10-29T03:53:44.631-0400 7fbda185e700 0 log_channel(cluster) log[INF] : osd.51 marked down after no beacon for 903.943390 seconds2020-10-29T03:53:44.631-0400 7fbda185e700 -1mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51 since2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago. marking down

I haven't seen this happen before in any cluster. It's also strange thatthis only happens in this rack, the other two racks work fine.


ID    CLASS  WEIGHT      TYPE NAME

-1 1545.35999 root default-206 515.12000 rack 206

  -7           27.94499          host CEPH2-206-U16
...

-207 515.12000 rack 207

 -17           27.94499          host CEPH2-207-U16
...

-208 515.12000 rack 208

 -31           27.94499          host CEPH2-208-U16
...

That's how the CRUSHMap looks like. Straight forward and 3x replicationover 3 racks.


This issue only occurs in rack *207*.

Has anybody seen this before or knows where to start?

Wido
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Not all OSDs in rack marked as down when the rack fails

Reply via email to