> Date: Sun, 13 Aug 2023 06:16:51 -0400 > From: Greg Troxel <g...@lexort.com> > > Would it be useful for heartbeat to have a just-log-don't-panic option?
Worth considering, but... > It feels like are in a state where we know there is a problem somewhere, > and we don't know if it is in heartbeat, the kernel, or hardware. ...in this case it is already clear that under heavy disk I/O, something is either holding onto a spin lock or starving softints and threads at priority below softbio for much too long. Holding a spin lock -- or otherwise running at raised IPL -- for 5sec is already enough to violate the contract of the timecounter at hz=100, which can lead to monotonic time going backwards, which breaks all kinds of things but maybe only in subtle ways that are extremely hard to diagnose retrospectively. I thought all the uvm aiodone business was supposed tbe deferred to workqueue context (which would not hold up heartbeats), but it looks like we have a path (softbio) biodone -> biodone2 -> uvm_aio_aiodone -> uvm_pagermapout -> vm_map_lock -> cv_wait which is forbidden in softint context (and should really trip a KASSERT). This might not be the problem but it's evidence that the code path is on shaky grounds. > I would not want to run a watchdog that reboots the system unless the FP > rate is well under once per year, and really under 0.2/year. Having > this logged instead of panicing would make it more comfortable to turn > on. Probably it should be default to not panic, if this turns into > enough reports that it seems to have significantly non-zero probability. So far the only reports I've seen have been true alarms about something being broken. Most of the problems that this will catch would otherwise manifest as `NetBSD stopped responding and I wasn't able to get a core dump' (leading to useless undiagnosable PRs), not as `huh, I saw this weird detailed log message', so the diagnostic value of the heartbeat panic in those circumstances is very high. Note that a hardware watchdog timer is a little bit different: it will usually just reset the machine, giving no opportunity for diagnostics like a crash dump. > (Presumably atf runs on real hw survive HEARTBEAT though, so whatever is > happening seems low probability to start with.) Right. My guess is that this may be related to problems that we've been trying to diagnose regarding extreme delays at shutdown after heavy disk I/O, which we need more information to figure out. Possibly related to the yamt-pagecache merge, possibly related to the zfs pagedaemon thrashing.