Hello, I have a ceph cluster running over 2 years and the monitor began crash since yesterday. I had some flapping OSDs up and down occasionally, sometimes I need to rebuild the OSD. I found 3 OSDs are down yesterday, they may cause this issue or may not.
Ceph Version: 12.2.12, ( upgraded from 12.2.8 not fix the issue) I have 5 mon nodes, when I start mon service on the first 2 nodes, they are good. Once I start the service on the third node, All 3 nodes begin keeping up/down(flapping) due to Aborted in OSDMonitor::build_incremental. I also tried to recover monitor from 1 node(remove other 4 nodes) by injecting monmap, the node keep crash as well. See below crash log from mon May 31 02:26:09 ctlr101 systemd[1]: Started Ceph cluster monitor daemon. May 31 02:26:09 ctlr101 ceph-mon[2632098]: 2019-05-31 02:26:09.345533 7fe250321080 -1 compacting monitor store ... May 31 02:26:11 ctlr101 ceph-mon[2632098]: 2019-05-31 02:26:11.320926 7fe250321080 -1 done compacting May 31 02:26:16 ctlr101 ceph-mon[2632098]: 2019-05-31 02:26:16.497933 7fe242925700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 13 osds down; 1 host (6 osds) down; 74266/2566020 objects misplace May 31 02:26:16 ctlr101 ceph-mon[2632098]: *** Caught signal (Aborted) ** May 31 02:26:16 ctlr101 ceph-mon[2632098]: in thread 7fe24692d700 thread_name:ms_dispatch May 31 02:26:16 ctlr101 ceph-mon[2632098]: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) May 31 02:26:16 ctlr101 ceph-mon[2632098]: 1: (()+0x9e6334) [0x558c5f2fb334] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 2: (()+0x11390) [0x7fe24f6ce390] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 3: (gsignal()+0x38) [0x7fe24dc14428] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 4: (abort()+0x16a) [0x7fe24dc1602a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 6: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 14: (()+0x76ba) [0x7fe24f6c46ba] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 15: (clone()+0x6d) [0x7fe24dce641d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 2019-05-31 02:26:16.578932 7fe24692d700 -1 *** Caught signal (Aborted) ** May 31 02:26:16 ctlr101 ceph-mon[2632098]: in thread 7fe24692d700 thread_name:ms_dispatch May 31 02:26:16 ctlr101 ceph-mon[2632098]: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) May 31 02:26:16 ctlr101 ceph-mon[2632098]: 1: (()+0x9e6334) [0x558c5f2fb334] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 2: (()+0x11390) [0x7fe24f6ce390] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 3: (gsignal()+0x38) [0x7fe24dc14428] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 4: (abort()+0x16a) [0x7fe24dc1602a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 6: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 14: (()+0x76ba) [0x7fe24f6c46ba] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 15: (clone()+0x6d) [0x7fe24dce641d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. May 31 02:26:16 ctlr101 ceph-mon[2632098]: -1501> 2019-05-31 02:26:09.345533 7fe250321080 -1 compacting monitor store ... May 31 02:26:16 ctlr101 ceph-mon[2632098]: -1475> 2019-05-31 02:26:11.320926 7fe250321080 -1 done compacting May 31 02:26:16 ctlr101 ceph-mon[2632098]: -946> 2019-05-31 02:26:16.497933 7fe242925700 -1 log_channel(cluster) log [ERR] : overall HEALTH_ERR 13 osds down; 1 host (6 osds) down; 74266/2566020 objects May 31 02:26:16 ctlr101 ceph-mon[2632098]: 0> 2019-05-31 02:26:16.578932 7fe24692d700 -1 *** Caught signal (Aborted) ** May 31 02:26:16 ctlr101 ceph-mon[2632098]: in thread 7fe24692d700 thread_name:ms_dispatch May 31 02:26:16 ctlr101 ceph-mon[2632098]: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) May 31 02:26:16 ctlr101 ceph-mon[2632098]: 1: (()+0x9e6334) [0x558c5f2fb334] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 2: (()+0x11390) [0x7fe24f6ce390] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 3: (gsignal()+0x38) [0x7fe24dc14428] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 4: (abort()+0x16a) [0x7fe24dc1602a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 5: (OSDMonitor::build_incremental(unsigned int, unsigned int, unsigned long)+0x9c5) [0x558c5ee80455] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 6: (OSDMonitor::send_incremental(unsigned int, MonSession*, bool, boost::intrusive_ptr<MonOpRequest>)+0xcf) [0x558c5ee80b3f] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 7: (OSDMonitor::check_osdmap_sub(Subscription*)+0x22d) [0x558c5ee8622d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 8: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x1082) [0x558c5ecdb0b2] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x9f4) [0x558c5ed05114] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 10: (Monitor::_ms_dispatch(Message*)+0x6db) [0x558c5ed061ab] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x558c5ed372c3] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 12: (DispatchQueue::entry()+0xf4a) [0x558c5f2a205a] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x558c5f035dcd] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 14: (()+0x76ba) [0x7fe24f6c46ba] May 31 02:26:16 ctlr101 ceph-mon[2632098]: 15: (clone()+0x6d) [0x7fe24dce641d] May 31 02:26:16 ctlr101 ceph-mon[2632098]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. May 31 02:26:16 ctlr101 systemd[1]: ceph-mon@ctlr101.service: Main process exited, code=killed, status=6/ABRT May 31 02:26:16 ctlr101 systemd[1]: ceph-mon@ctlr101.service: Unit entered failed state. May 31 02:26:16 ctlr101 systemd[1]: ceph-mon@ctlr101.service: Failed with result 'signal'. May 31 02:26:26 ctlr101 systemd[1]: ceph-mon@ctlr101.service: Service hold-off time over, scheduling restart. May 31 02:26:26 ctlr101 systemd[1]: Stopped Ceph cluster monitor daemon. May 31 02:26:26 ctlr101 systemd[1]: Started Ceph cluster monitor daemon. For command ceph -s, most of time, it's timeout. Sometimes when I have 3+ mon services are up, I can get result, but mon service become down very quickly. root@ctlr101:~# ceph -s cluster: id: 53264466-680b-42e6-899d-d042c3a8334a health: HEALTH_ERR 6 osds down 1 host (6 osds) down 74266/2566020 objects misplaced (2.894%) Reduced data availability: 446 pgs inactive, 440 pgs peering Degraded data redundancy: 108173/2566020 objects degraded (4.216%), 142 pgs degraded, 330 pgs undersized 18600 slow requests are blocked > 32 sec. Implicated osds 8,21,27,29,32,41,63,91,96,98,100 27371 stuck requests are blocked > 4096 sec. Implicated osds 14,25,26,34,37,46,48,50,51,58,59,60,61,66,67,69,73,74,75,90,95,99 2/5 mons down, quorum ctlr101,ctlr201,ctlr301 services: mon: 5 daemons, quorum ctlr101,ctlr201,ctlr301, out of quorum: ceph101, ceph201 mgr: ceph101(active), standbys: ceph301, ctlr201, ctlr301, ceph201, ctlr101 mds: cephfs-1/1/1 up {0=ceph101=up:active}, 2 up:standby osd: 52 osds: 46 up, 52 in; 22 remapped pgs rgw: 3 daemons active data: pools: 20 pools, 2528 pgs objects: 855.34k objects, 3.69TiB usage: 11.4TiB used, 28.3TiB / 39.7TiB avail pgs: 0.237% pgs unknown 17.445% pgs not active 108173/2566020 objects degraded (4.216%) 74266/2566020 objects misplaced (2.894%) 1667 active+clean 413 peering 198 active+undersized 141 active+undersized+degraded 60 active+remapped+backfill_wait 27 remapped+peering 12 active+clean+remapped 6 unknown 2 active+undersized+remapped 1 active+undersized+degraded+remapped+backfilling 1 remapped io: client: 5.65MiB/s rd, 81.1KiB/s wr, 143op/s rd, 43op/s wr Note, the about io data is stale, the value hasn't been changed for 1 day. If anyone can give me some hints how to keep the mon service running, it will be great. Thanks in advance. Best Regards, Li JianYu
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com