I have had this happen during large data movements.  Stopped happening after
I went to 10Gb though(from 1Gb).  What I had done is injected a setting (
and adjusted the configs ) to give more time before an OSD was marked down.

 

osd heartbeat grace = 200

mon osd down out interval = 900

 

For injecting runtime values/settings( under runtime changes ):

http://docs.ceph.com/docs/luminous/rados/configuration/ceph-conf/ 

 

Probably should check the logs before doing anything to ensure the OSDs or
host is not failing.  

 

-Brent

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
CUZA Frédéric
Sent: Tuesday, July 31, 2018 5:06 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Whole cluster flapping

 

Hi Everyone,

 

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
pool that we had (120 TB).

Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
OSD), we have SDD for journal.

 

After I deleted the large pool my cluster started to flapping on all OSDs.

Osds are marked down and then marked up as follow :

 

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
172.29.228.72:6800/95783 boot

2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
degraded, 317 pgs undersized (PG_DEGRADED)

2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)

2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
172.29.228.72:6803/95830 boot

2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
osds down (OSD_DOWN)

2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
degraded, 223 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
172.29.228.246:6812/3144542 boot

2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
osds down (OSD_DOWN)

2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)

2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
degraded, 220 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
degraded, 197 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
degraded, 197 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
(root=default,room=xxxx,host=xxxx) (8 reporters from different host after
54.650576 >= grace 54.300663)

2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
osds down (OSD_DOWN)

2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
degraded, 201 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78
slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4
osds down (OSD_DOWN)

2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18
172.29.228.5:6812/14996 boot

2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update:
5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)

2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update:
Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)

2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update:
Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs
degraded, 201 pgs undersized (PG_DEGRADED)

2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11
slow requests are blocked > 32 sec (REQUEST_SLOW)

 

On the OSDs that failed logs are full of this kind of message :

2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

 

At first it seems like a network issue but we haven’t change a thing on the
network and this cluster has been okay for months.

 

I can’t figure out what is happening at the moment, some help will be
greatly appreciated !

 

Regards,

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to