Public bug reported: In a recently juju deployed 13.2.4 ceph cluster (as part of an OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN event that appeared to be associated with a short planned network outage, but did not clear without human intervention:
health: HEALTH_WARN 6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops. We can correlate this back to a known network event, but all OSDs are up and the cluster otherwise looks healthy: ubuntu@juju-df624b-4-lxd-14:~$ sudo ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 7.64076 root default -13 0.90970 host happiny 8 hdd 0.90970 osd.8 up 1.00000 1.00000 -5 0.90970 host jynx 9 hdd 0.90970 osd.9 up 1.00000 1.00000 -3 1.63739 host piplup 0 hdd 0.81870 osd.0 up 1.00000 1.00000 3 hdd 0.81870 osd.3 up 1.00000 1.00000 -9 1.63739 host raichu 5 hdd 0.81870 osd.5 up 1.00000 1.00000 6 hdd 0.81870 osd.6 up 1.00000 1.00000 -11 0.90919 host shinx 7 hdd 0.90919 osd.7 up 1.00000 1.00000 -7 1.63739 host sliggoo 1 hdd 0.81870 osd.1 up 1.00000 1.00000 4 hdd 0.81870 osd.4 up 1.00000 1.00000 ubuntu@shinx:~$ sudo ceph daemon mon.shinx ops { "ops": [ { "description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)", "initiated_at": "2019-03-07 00:40:43.282823", "age": 113953.696205, "duration": 113953.696225, "type_data": { "events": [ { "time": "2019-03-07 00:40:43.282823", "event": "initiated" }, { "time": "2019-03-07 00:40:43.282823", "event": "header_read" }, { "time": "0.000000", "event": "throttled" }, { "time": "0.000000", "event": "all_read" }, { "time": "0.000000", "event": "dispatched" }, { "time": "2019-03-07 00:40:43.283360", "event": "mon:_ms_dispatch" }, { "time": "2019-03-07 00:40:43.283360", "event": "mon:dispatch_op" }, { "time": "2019-03-07 00:40:43.283360", "event": "psvc:dispatch" }, { "time": "2019-03-07 00:40:43.283370", "event": "osdmap:preprocess_query" }, { "time": "2019-03-07 00:40:43.283371", "event": "osdmap:preprocess_failure" }, { "time": "2019-03-07 00:40:43.283386", "event": "osdmap:prepare_update" }, { "time": "2019-03-07 00:40:43.283386", "event": "osdmap:prepare_failure" } ], "info": { "seq": 48576937, "src_is_mon": false, "source": "osd.8 10.48.2.206:6800/1226277", "forwarded_to_leader": false } } }, { "description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)", "initiated_at": "2019-03-07 00:40:43.282997", "age": 113953.696032, "duration": 113953.696127, "type_data": { "events": [ { "time": "2019-03-07 00:40:43.282997", "event": "initiated" }, { "time": "2019-03-07 00:40:43.282997", "event": "header_read" }, { "time": "0.000000", "event": "throttled" }, { "time": "0.000000", "event": "all_read" }, { "time": "0.000000", "event": "dispatched" }, { "time": "2019-03-07 00:40:43.284394", "event": "mon:_ms_dispatch" }, { "time": "2019-03-07 00:40:43.284395", "event": "mon:dispatch_op" }, { "time": "2019-03-07 00:40:43.284395", "event": "psvc:dispatch" }, { "time": "2019-03-07 00:40:43.284402", "event": "osdmap:preprocess_query" }, { "time": "2019-03-07 00:40:43.284403", "event": "osdmap:preprocess_failure" }, { "time": "2019-03-07 00:40:43.284416", "event": "osdmap:prepare_update" }, { "time": "2019-03-07 00:40:43.284417", "event": "osdmap:prepare_failure" } ], "info": { "seq": 48576958, "src_is_mon": false, "source": "osd.8 10.48.2.206:6800/1226277", "forwarded_to_leader": false } } }, { "description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)", "initiated_at": "2019-03-07 00:41:08.839840", "age": 113928.139188, "duration": 113928.139359, "type_data": { "events": [ { "time": "2019-03-07 00:41:08.839840", "event": "initiated" }, { "time": "2019-03-07 00:41:08.839840", "event": "header_read" }, { "time": "0.000000", "event": "throttled" }, { "time": "0.000000", "event": "all_read" }, { "time": "0.000000", "event": "dispatched" }, { "time": "2019-03-07 00:41:08.840040", "event": "mon:_ms_dispatch" }, { "time": "2019-03-07 00:41:08.840040", "event": "mon:dispatch_op" }, { "time": "2019-03-07 00:41:08.840040", "event": "psvc:dispatch" }, { "time": "2019-03-07 00:41:08.840058", "event": "osdmap:preprocess_query" }, { "time": "2019-03-07 00:41:08.840060", "event": "osdmap:preprocess_failure" }, { "time": "2019-03-07 00:41:08.840080", "event": "osdmap:prepare_update" }, { "time": "2019-03-07 00:41:08.840081", "event": "osdmap:prepare_failure" } ], "info": { "seq": 48578207, "src_is_mon": false, "source": "osd.6 10.48.2.161:6800/499396", "forwarded_to_leader": false } } } ], "num_ops": 3 } This looks remarkably like: https://tracker.ceph.com/issues/24531 I restarted the 2 affected mons in turn, HEALTH OK and issue did not reoccur. Expected behaviour: ceph health should recover from temporary network event without user interaction. ** Affects: ceph (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1819437 Title: transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs