There are two edge-cases in 12.2.11 where a worker thread's suicide_grace value gets dropped: [0] In the Threadpool context, Threadpool:worker() drops suicide_grace while waiting on an empty work queue. [1] In the ShardedThreadpool context, OSD::ShardedOpWQ::_process() drops suicide_grace while opportunistically waiting for more work (to prevent additional lock contention).
The Threadpool context always re-assigns suicide_grace before driving any work. The ShardedThreadpool context does not follow this pattern. After delaying to find additional work, the default sharded work queue timeouts are not re-applied. This oversight exists in Luminous on-wards. Mimic, and Nautilus have each reworked the ShardedOpWQ code path, but did not address the problem. [0] https://github.com/ceph/ceph/blob/v12.2.11/src/common/WorkQueue.cc#L137 [1] https://github.com/ceph/ceph/blob/v12.2.11/src/osd/OSD.cc#L10476 ** Description changed: - Multiple incidents have been seen where ops were blocked for various - reasons and the suicide_grace timeout was not observed, meaning that the - OSD failed to suicide as expected. + [Impact] + The Sharded OpWQ will opportunistically wait for more work when processing an + empty queue. While waiting, the heartbeat timeout and suicide_grace values are + modified. On Luminous, the `threadpool_default_timeout` grace is left applied + and suicide_grace is left disabled. On later releases both the grace and + suicide_grace are left disabled. + + After finding work, the original work queue grace/suicide_grace values are + not re-applied. This can result in hung operations that do not trigger an OSD + suicide recovery. + + The missing suicide recovery was observed on Luminous 12.2.11. The environment + was consistently hitting a known authentication race condition (issue#37778 + [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a + faulty DIMM. + + The auth race condition would stall pg operations. In some cases the hung ops + would persist for hours without suicide recovery. + + [Test Case] + - In-Progress - + Haven't landed on a reliable reproducer. Currently testing the fix by + exercising I/O. Since the fix applies to all version of Ceph, the plan is to + let this bake in the latest release before considering a back-port. + + [Regression Potential] + This fix improves suicide_grace coverage of the Sharded OpWq. + + This change is made in a critical code path that drives client I/O. An OSD + suicide will trigger a service restart and repeated restarts (flapping) will + adversely impact cluster performance. + + The fix mitigates risk by keeping the applied suicide_grace value consistent + with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix + is also restricted to the empty queue edge-case that drops the suicide_grace + timeout. The suicide_grace value is only re-applied when work is found after + waiting on an empty queue. + + - In-Progress - + The fix will bake upstream on later levels before back-port consideration. ** Description changed: [Impact] - The Sharded OpWQ will opportunistically wait for more work when processing an - empty queue. While waiting, the heartbeat timeout and suicide_grace values are - modified. On Luminous, the `threadpool_default_timeout` grace is left applied - and suicide_grace is left disabled. On later releases both the grace and - suicide_grace are left disabled. + The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. - After finding work, the original work queue grace/suicide_grace values are - not re-applied. This can result in hung operations that do not trigger an OSD - suicide recovery. + After finding work, the original work queue grace/suicide_grace values + are not re-applied. This can result in hung operations that do not + trigger an OSD suicide recovery. - The missing suicide recovery was observed on Luminous 12.2.11. The environment - was consistently hitting a known authentication race condition (issue#37778 - [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a - faulty DIMM. + The missing suicide recovery was observed on Luminous 12.2.11. The + environment was consistently hitting a known authentication race + condition (issue#37778 [0]) due to repeated OSD service restarts on a + node exhibiting MCEs from a faulty DIMM. - The auth race condition would stall pg operations. In some cases the hung ops - would persist for hours without suicide recovery. + The auth race condition would stall pg operations. In some cases, the + hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - - Haven't landed on a reliable reproducer. Currently testing the fix by - exercising I/O. Since the fix applies to all version of Ceph, the plan is to - let this bake in the latest release before considering a back-port. + Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] - This fix improves suicide_grace coverage of the Sharded OpWq. + This fix improves suicide_grace coverage of the Sharded OpWq. - This change is made in a critical code path that drives client I/O. An OSD - suicide will trigger a service restart and repeated restarts (flapping) will - adversely impact cluster performance. + This change is made in a critical code path that drives client I/O. An + OSD suicide will trigger a service restart and repeated restarts + (flapping) will adversely impact cluster performance. - The fix mitigates risk by keeping the applied suicide_grace value consistent - with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix - is also restricted to the empty queue edge-case that drops the suicide_grace - timeout. The suicide_grace value is only re-applied when work is found after - waiting on an empty queue. + The fix mitigates risk by keeping the applied suicide_grace value + consistent with the value applied before entering + `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty + queue edge-case that drops the suicide_grace timeout. The suicide_grace + value is only re-applied when work is found after waiting on an empty + queue. - In-Progress - The fix will bake upstream on later levels before back-port consideration. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs