[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
@dbungert Apologies, got sidetracked and this version has been superceded. Can we rebase this debdiff on the 12.2.13-0ubuntu0.18.04.10 version? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
Hello Kellen, or anyone else affected, Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.9 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed- bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification- failed-bionic. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. ** Changed in: ceph (Ubuntu Bionic) Status: In Progress => Fix Committed ** Tags added: verification-needed verification-needed-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
unsubscribing ubuntu sponsors team, as sponsorship for this is being requested from the openstack team -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
@dbungert - Taking over this one for hillpd, we would like to pursue this backport to Bionic/Queens (dropping Stein). Originally held back on it since there isn't a viable reproducer. Now, since the change has been in upstream since May 1, 2020, there is confidence that it doesn't introduce a regression. ** Changed in: cloud-archive/stein Status: In Progress => Won't Fix ** Changed in: ceph (Ubuntu Bionic) Assignee: Dan Hill (hillpd) => Kellen Renshaw (krenshaw) ** Changed in: cloud-archive/queens Assignee: Dan Hill (hillpd) => Kellen Renshaw (krenshaw) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
@hillpd - should this still be in the sponsor queue? Are we still trying to SRU this to Bionic? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Changed in: cloud-archive/train Importance: Undecided => Medium ** Changed in: cloud-archive/stein Importance: Undecided => Medium ** Changed in: cloud-archive/queens Importance: Undecided => Medium ** Changed in: cloud-archive Importance: Undecided => Medium ** Changed in: cloud-archive/rocky Importance: Undecided => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Changed in: cloud-archive/rocky Status: Invalid => Won't Fix -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Changed in: cloud-archive/train Status: New => Fix Released ** Changed in: cloud-archive/rocky Status: New => Invalid ** Changed in: cloud-archive/stein Status: New => In Progress ** Changed in: cloud-archive/train Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: cloud-archive/stein Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: cloud-archive/rocky Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: cloud-archive/queens Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: ceph (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: cloud-archive/queens Status: New => In Progress ** Changed in: cloud-archive Status: New => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Also affects: cloud-archive/train Importance: Undecided Status: New ** Also affects: cloud-archive/rocky Importance: Undecided Status: New ** Also affects: cloud-archive/queens Importance: Undecided Status: New ** Also affects: cloud-archive/stein Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Also affects: cloud-archive Importance: Undecided Status: New ** Changed in: cloud-archive Assignee: (unassigned) => Dan Hill (hillpd) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Changed in: ceph (Ubuntu) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
The SRUs for 15.2.5 [0] and 14.2.11 [1] have been released and contain a fix for this issue. We are currently evaluating the need for a fix in Luminous (Bionic/Queens) and Mimic (Stein). [0] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1898200 [1] https://bugs.launchpad.net/cloud-archive/+bug/1891077 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Changed in: ceph (Ubuntu Focal) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
The Eoan Ermine has reached end of life, so this bug will not be fixed for that release ** Changed in: ceph (Ubuntu Eoan) Status: Confirmed => Won't Fix -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Description changed: [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - - In-Progress - - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. + I have not identified a reliable reproducer. Currently testing the fix by exercising I/O. + + Recommend letting this bake upstream before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - - The fix needs to bake upstream on later levels before back-port consideration. + Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2] + + [0] https://tracker.ceph.com/issues/37778 + [1] https://tracker.ceph.com/issues/45076 + [2] https://github.com/ceph/ceph/pull/34575 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Description changed: [Impact] - The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. + The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_default_timeout` grace is left applied and suicide_grace is disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - The fix needs to bake upstream on later levels before back-port consideration. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
The attachment "ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team. [This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.] ** Tags added: patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Also affects: ceph (Ubuntu Focal) Importance: Medium Assignee: Dan Hill (hillpd) Status: Triaged ** Also affects: ceph (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: ceph (Ubuntu Eoan) Importance: Undecided Status: New ** Changed in: ceph (Ubuntu Bionic) Status: New => Confirmed ** Changed in: ceph (Ubuntu Bionic) Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: ceph (Ubuntu Eoan) Assignee: (unassigned) => Dan Hill (hillpd) ** Changed in: ceph (Ubuntu Bionic) Importance: Undecided => Medium ** Changed in: ceph (Ubuntu Eoan) Importance: Undecided => Medium ** Changed in: ceph (Ubuntu Eoan) Status: New => Confirmed ** Changed in: ceph (Ubuntu Focal) Status: Triaged => Confirmed ** Description changed: [Impact] The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery. The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM. The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - Haven't landed on a reliable reproducer. Currently testing the fix by exercising I/O. Since the fix applies to all version of Ceph, the plan is to let this bake in the latest release before considering a back-port. [Regression Potential] This fix improves suicide_grace coverage of the Sharded OpWq. This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance. The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix is also restricted to the empty queue edge-case that drops the suicide_grace timeout. The suicide_grace value is only re-applied when work is found after waiting on an empty queue. - In-Progress - - The fix will bake upstream on later levels before back-port consideration. + The fix needs to bake upstream on later levels before back-port consideration. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
Attaching the proposed fix for 12.2.13 that I am testing. ** Patch added: "ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff" https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+attachment/5351517/+files/ceph_12.2.13-0ubuntu0.18.04.1+20200409sf00238701b1.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
There are two edge-cases in 12.2.11 where a worker thread's suicide_grace value gets dropped: [0] In the Threadpool context, Threadpool:worker() drops suicide_grace while waiting on an empty work queue. [1] In the ShardedThreadpool context, OSD::ShardedOpWQ::_process() drops suicide_grace while opportunistically waiting for more work (to prevent additional lock contention). The Threadpool context always re-assigns suicide_grace before driving any work. The ShardedThreadpool context does not follow this pattern. After delaying to find additional work, the default sharded work queue timeouts are not re-applied. This oversight exists in Luminous on-wards. Mimic, and Nautilus have each reworked the ShardedOpWQ code path, but did not address the problem. [0] https://github.com/ceph/ceph/blob/v12.2.11/src/common/WorkQueue.cc#L137 [1] https://github.com/ceph/ceph/blob/v12.2.11/src/osd/OSD.cc#L10476 ** Description changed: - Multiple incidents have been seen where ops were blocked for various - reasons and the suicide_grace timeout was not observed, meaning that the - OSD failed to suicide as expected. + [Impact] + The Sharded OpWQ will opportunistically wait for more work when processing an + empty queue. While waiting, the heartbeat timeout and suicide_grace values are + modified. On Luminous, the `threadpool_default_timeout` grace is left applied + and suicide_grace is left disabled. On later releases both the grace and + suicide_grace are left disabled. + + After finding work, the original work queue grace/suicide_grace values are + not re-applied. This can result in hung operations that do not trigger an OSD + suicide recovery. + + The missing suicide recovery was observed on Luminous 12.2.11. The environment + was consistently hitting a known authentication race condition (issue#37778 + [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a + faulty DIMM. + + The auth race condition would stall pg operations. In some cases the hung ops + would persist for hours without suicide recovery. + + [Test Case] + - In-Progress - + Haven't landed on a reliable reproducer. Currently testing the fix by + exercising I/O. Since the fix applies to all version of Ceph, the plan is to + let this bake in the latest release before considering a back-port. + + [Regression Potential] + This fix improves suicide_grace coverage of the Sharded OpWq. + + This change is made in a critical code path that drives client I/O. An OSD + suicide will trigger a service restart and repeated restarts (flapping) will + adversely impact cluster performance. + + The fix mitigates risk by keeping the applied suicide_grace value consistent + with the value applied before entering `OSD::ShardedOpWQ::_process()`. The fix + is also restricted to the empty queue edge-case that drops the suicide_grace + timeout. The suicide_grace value is only re-applied when work is found after + waiting on an empty queue. + + - In-Progress - + The fix will bake upstream on later levels before back-port consideration. ** Description changed: [Impact] - The Sharded OpWQ will opportunistically wait for more work when processing an - empty queue. While waiting, the heartbeat timeout and suicide_grace values are - modified. On Luminous, the `threadpool_default_timeout` grace is left applied - and suicide_grace is left disabled. On later releases both the grace and - suicide_grace are left disabled. + The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. On Luminous, the `threadpool_default_timeout` grace is left applied and suicide_grace is left disabled. On later releases both the grace and suicide_grace are left disabled. - After finding work, the original work queue grace/suicide_grace values are - not re-applied. This can result in hung operations that do not trigger an OSD - suicide recovery. + After finding work, the original work queue grace/suicide_grace values + are not re-applied. This can result in hung operations that do not + trigger an OSD suicide recovery. - The missing suicide recovery was observed on Luminous 12.2.11. The environment - was consistently hitting a known authentication race condition (issue#37778 - [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a - faulty DIMM. + The missing suicide recovery was observed on Luminous 12.2.11. The + environment was consistently hitting a known authentication race + condition (issue#37778 [0]) due to repeated OSD service restarts on a + node exhibiting MCEs from a faulty DIMM. - The auth race condition would stall pg operations. In some cases the hung ops - would persist for hours without suicide recovery. + The auth race condition would stall pg operations. In some cases, the + hung ops would persist for hours without suicide recovery. [Test Case] - In-Progress - - Haven't landed on a reliable reproducer.
[Bug 1840348] Re: Sharded OpWQ drops suicide_grace after waiting for work
** Summary changed: - Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace + Sharded OpWQ drops suicide_grace after waiting for work -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840348 Title: Sharded OpWQ drops suicide_grace after waiting for work To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1840348/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs