Re: [ceph-users] Whole cluster flapping

CUZA Frédéric Tue, 07 Aug 2018 06:48:20 -0700

Pool is already deleted and no longer present in stats.

Regards,


De : ceph-users <ceph-users-boun...@lists.ceph.com> De la part de Webert de 
Souza Lima
Envoyé : 07 August 2018 15:08
À : ceph-users <ceph-users@lists.ceph.com>
Objet : Re: [ceph-users] Whole cluster flapping

Frédéric,

see if the number of objects is decreasing in the pool with `ceph df [detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
It’s been over a week now and the whole cluster keeps flapping, it is never the 
same OSDs that go down.
Is there a way to get the progress of this recovery ? (The pool hat I deleted 
is no longer present (for a while now))
In fact, there is a lot of i/o activity on the server where osds go down.

Regards,

De : ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
De la part de Webert de Souza Lima
Envoyé : 31 July 2018 16:25
À : ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Objet : Re: [ceph-users] Whole cluster flapping

The pool deletion might have triggered a lot of IO operations on the disks and 
the process might be too busy to respond to hearbeats, so the mons mark them as 
down due to no response.
Check also the OSD logs to see if they are actually crashing and restarting, 
and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
<frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote:
Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool 
that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), 
we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 
172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 
5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs 
degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 
172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 
5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs 
degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 
172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 
5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs 
degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 
5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 
5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs 
degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed 
(root=default,room=xxxx,host=xxxx) (8 reporters from different host after 
54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds 
down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 
5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 
slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds 
down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 
172.29.228.5:6812/14996<http://172.29.228.5:6812/14996> boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 
5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: 
Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: 
Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs 
degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11 
slow requests are blocked > 32 sec (REQUEST_SLOW)

On the OSDs that failed logs are full of this kind of message :
2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

At first it seems like a network issue but we haven’t change a thing on the 
network and this cluster has been okay for months.

I can’t figure out what is happening at the moment, some help will be greatly 
appreciated !

Regards,
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Whole cluster flapping

Reply via email to