Pool is already deleted and no longer present in stats. Regards,
De : ceph-users <ceph-users-boun...@lists.ceph.com> De la part de Webert de Souza Lima Envoyé : 07 August 2018 15:08 À : ceph-users <ceph-users@lists.ceph.com> Objet : Re: [ceph-users] Whole cluster flapping Frédéric, see if the number of objects is decreasing in the pool with `ceph df [detail]` Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric <frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote: It’s been over a week now and the whole cluster keeps flapping, it is never the same OSDs that go down. Is there a way to get the progress of this recovery ? (The pool hat I deleted is no longer present (for a while now)) In fact, there is a lot of i/o activity on the server where osds go down. Regards, De : ceph-users <ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> De la part de Webert de Souza Lima Envoyé : 31 July 2018 16:25 À : ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Objet : Re: [ceph-users] Whole cluster flapping The pool deletion might have triggered a lot of IO operations on the disks and the process might be too busy to respond to hearbeats, so the mons mark them as down due to no response. Check also the OSD logs to see if they are actually crashing and restarting, and disk IO usage (i.e. iostat). Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric <frederic.c...@sib.fr<mailto:frederic.c...@sib.fr>> wrote: Hi Everyone, I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB). Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal. After I deleted the large pool my cluster started to flapping on all OSDs. Osds are marked down and then marked up as follow : 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783<http://172.29.228.72:6800/95783> boot 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED) 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY) 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830<http://172.29.228.72:6803/95830> boot 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542<http://172.29.228.246:6812/3144542> boot 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED) 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=xxxx,host=xxxx) (8 reporters from different host after 54.650576 >= grace 54.300663) 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN) 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN) 2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 172.29.228.5:6812/14996<http://172.29.228.5:6812/14996> boot 2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED) 2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY) 2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11 slow requests are blocked > 32 sec (REQUEST_SLOW) On the OSDs that failed logs are full of this kind of message : 2018-07-31 03:41:28.789681 7f698b66c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:28.945710 7f698ae6b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:28.946263 7f698be6d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:28.994397 7f698b66c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:28.994443 7f698ae6b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:29.023356 7f698be6d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:29.023415 7f698be6d700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:29.102909 7f698ae6b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 2018-07-31 03:41:29.102917 7f698b66c700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15 At first it seems like a network issue but we haven’t change a thing on the network and this cluster has been okay for months. I can’t figure out what is happening at the moment, some help will be greatly appreciated ! Regards, _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com