On Mon, Oct 6, 2025 at 4:32 PM Doug Anderson <[email protected]> wrote: > > Hi, > > On Mon, Oct 6, 2025 at 2:30 PM Ian Rogers <[email protected]> wrote: > > > > On Tue, Sep 16, 2025 at 11:14 PM Jinchao Wang <[email protected]> > > wrote: > > > > > > On Tue, Sep 16, 2025 at 10:35:46PM -0700, Namhyung Kim wrote: > > > > Hello, > > > > > > > > On Tue, Sep 16, 2025 at 10:13:12PM -0700, Ian Rogers wrote: > > > > > On Tue, Sep 16, 2025 at 6:47 PM Jinchao Wang > > > > > <[email protected]> wrote: > > > > > > > > > > > > On Tue, Sep 16, 2025 at 05:03:48PM -0700, Ian Rogers wrote: > > > > > > > On Tue, Sep 16, 2025 at 7:51 AM Jinchao Wang > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > Currently, the hard lockup detector is selected at compile time > > > > > > > > via > > > > > > > > Kconfig, which requires a kernel rebuild to switch > > > > > > > > implementations. > > > > > > > > This is inflexible, especially on systems where a perf event > > > > > > > > may not > > > > > > > > be available or may be needed for other tasks. > > > > > > > > > > > > > > > > This commit refactors the hard lockup detector to replace a > > > > > > > > rigid > > > > > > > > compile-time choice with a flexible build-time and boot-time > > > > > > > > solution. > > > > > > > > The patch supports building the kernel with either detector > > > > > > > > independently, or with both. When both are built, a new boot > > > > > > > > parameter > > > > > > > > `hardlockup_detector="perf|buddy"` allows the selection at boot > > > > > > > > time. > > > > > > > > This is a more robust and user-friendly design. > > > > > > > > > > > > > > > > This patch is a follow-up to the discussion on the kernel > > > > > > > > mailing list > > > > > > > > regarding the preference and future of the hard lockup > > > > > > > > detectors. It > > > > > > > > implements a flexible solution that addresses the community's > > > > > > > > need to > > > > > > > > select an appropriate detector at boot time. > > > > > > > > > > > > > > > > The core changes are: > > > > > > > > - The `perf` and `buddy` watchdog implementations are separated > > > > > > > > into > > > > > > > > distinct functions (e.g., `watchdog_perf_hardlockup_enable`). > > > > > > > > - Global function pointers are introduced > > > > > > > > (`watchdog_hardlockup_enable_ptr`) > > > > > > > > to serve as a single API for the entire feature. > > > > > > > > - A new `hardlockup_detector=` boot parameter is added to allow > > > > > > > > the > > > > > > > > user to select the desired detector at boot time. > > > > > > > > - The Kconfig options are simplified by removing the complex > > > > > > > > `HARDLOCKUP_DETECTOR_PREFER_BUDDY` and allowing both > > > > > > > > detectors to be > > > > > > > > built without mutual exclusion. > > > > > > > > - The weak stubs are updated to call the new function pointers, > > > > > > > > centralizing the watchdog logic. > > > > > > > > > > > > > > What is the impact on /proc/sys/kernel/nmi_watchdog ? Is that > > > > > > > enabling and disabling whatever the boot time choice was? I'm not > > > > > > > sure > > > > > > > why this has to be a boot time option given the ability to > > > > > > > configure > > > > > > > via /proc/sys/kernel/nmi_watchdog. > > > > > > The new hardlockup_detector boot parameter and the existing > > > > > > /proc/sys/kernel/nmi_watchdog file serve different purposes. > > > > > > > > > > > > The boot parameter selects the type of hard lockup detector (perf > > > > > > or buddy). > > > > > > This choice is made once at boot. > > > > > > > > > > > > /proc/sys/kernel/nmi_watchdog, on the other hand, is only a simple > > > > > > on/off > > > > > > switch for the currently selected detector. It does not change the > > > > > > detector's > > > > > > type. > > > > > > > > > > So the name "nmi_watchdog" for the buddy watchdog is wrong for fairly > > > > > obvious naming reasons but also because we can't differentiate when a > > > > > perf event has been taken or not - this impacts perf that is choosing > > > > > not to group events in metrics because of it, reducing the metric's > > > > > accuracy. We need an equivalent "buddy_watchdog" file to the > > > > > "nmi_watchdog" file. If we have such a file then if I did "echo 1 > > > > > > /proc/sys/kernel/nmi_watchdog" I'd expect the buddy watchdog to be > > > > > disabled and the perf event one to be enabled. Similarly, if I did > > > > > "echo 1 > /proc/sys/kernel/buddy_watchdog" then I would expect the > > > > > perf event watchdog to be disabled and the buddy one enabled. If I did > > > > > "echo 0 > /proc/sys/kernel/nmi_watchdog; echo 0 > > > > > > /proc/sys/kernel/buddy_watchdog" then I'd expect neither to be > > > > > enabled. I don't see why choosing the type of watchdog implementation > > > > > at boot time is particularly desirable. It seems sensible to default > > > > > normal people to using the buddy watchdog (more perf events, power...) > > > > > and CONFIG_DEBUG_KERNEL type people to using the perf event one. As > > > > > the "nmi_watchdog" file may be assumed to control the buddy watchdog, > > > > > perhaps a compatibility option (where the "nmi_watchdog" file controls > > > > > the buddy watchdog) is needed so that user code has time to migrate. > > > > > > > > Sounds good to me. For perf tools, it'd be great if we can have a run- > > > > time check which watchdog is selected. > > > Considering backward compatibility, I prefer to keep > > > /proc/sys/kernel/nmi_watchdog and introduce a new file called > > > /proc/sys/kernel/hardlockup_detector_type, which only shows the default > > > string > > > or the boot parameter. > > > > Is there code using the buddy watchdog that cares about the > > /proc/sys/kernel/nmi_watchdog file? My assumption is that everything > > except Android is using the NMI watchdog, so a new > > /proc/sys/kernel/buddy_watchdog file doesn't impact them. > > Buddy watchdog has been out there for a few years. At Google, I know > it's used by everything except Android. AKA I believe it is used in > Google's servers and also in ChromeOS. Both of those (presumably) > enable the buddy watchdog via calling it "nmi_watchdog". It's possible > that some Android phones are using the buddy watchdog too but I'm not > aware of it. I don't believe Pixel is using it, though that could > change in the future.
I thought what motivated the buddy watchdog was patches implementing this approach on Android for ARM that lacked an NMI based hard lockup detector? Anyway, while the buddy watchdog is in use by Google servers the nmi_watchdog file has an actively detrimental effect on that. Specifically the nmi_watchdog file having the value of "1" disables certain event groups for certain metrics in the perf tool as it is assumed there are too few performance counters due to the NMI watchdog stealing one. We want groups of events so that events are scheduled together and metrics are more accurate, we don't want groups that fail to schedule because of a lack of counters. > IMO at this point "nmi watchdog" is simply a synonym for the > hardlockup detector. That was what it looked like in the code before I > started messing around and adding the buddy lockup detector and it's > how it is now. While it's unfortunate that there are two names for the > same thing, I don't personally think that should change at this point. > FWIW, even the "buddy" watchdog relies on NMIs to actually get stack > crawls on stuck cores, so NMI still means something even there. I think this is misguided. Currently the only use of nmi_watchdog I'm aware of is by perf where the buddy watchdog's use of it causes issues (as mentioned above). > If we want to tell between the perf detector or the buddy detector we > should add a separate file for it. We could with perf then having to read from two files rather than one. Presumably the lack of presence of one file will be sufficient to also avoid a kernel version check. > > On Android > > writing to /proc/sys/kernel/nmi_watchdog would switch from updating > > the buddy watchdog enable/disable to the NMI watchdog enable/disable, > > but it is a straightforward patch to make anything doing this update > > the buddy_watchdog file instead. > > Straightforward, but you've got to go find everyone that you break by > doing this. That's not something I want responsibility for. If you > want to convince others this is something worthwhile and you've got > someone signed up to deal with the fallout (if any) then I won't > object, but it's not something I'd be in support of. Stuff like this happens, check out this thread: https://lore.kernel.org/lkml/2025020304-chip-trench-4e56@gregkh/ Imo we shouldn't design in using an actively misleading file name and incurring extra overhead in perf. Having two files nmi_watchdog and buddy_watchdog is fine as the latter case currently isn't in mainstream distro use and people shouldn't care. It also maintains and correct's perf's behavior when the buddy and not nmi watchdog is in use. > > If we have to keep "nmi_watchdog" then we should deprecate it and > > create an equivalent file with a better name (ie without NMI in it). > > It'll be moderately annoying in perf to determine whether the NMI > > watchdog is enabled by having to read two files. > > Again, up to you if you want to try to do this, but I'm not really in > support of it. It doesn't seem terribly hard to make a new file that > says which hardlockup detector is in use. If that file exists then > read it. If not then you fallback to what you have today. I don't mind a file that also says what kind of lockup detector is in use. I'd like the meaning of nmi_watchdog to keep meaning the kernel stole a perf counter as this is the behavior long assumed by perf which I think is the majority or only user of the file. I think the buddy watchdog being controlled by this file is less than intention revealing. Thanks, Ian > -Doug
