FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called heartbeat(9) that will make the system crash rather than hang when CPUs are stuck in certain ways that hardware watchdog timers can't detect (or on systems without hardware watchdog timers).
It's optional for now, but it's small and I'd like to make it mandatory in the future. If you'd like to try it out, add the following two lines to your kernel config: options HEARTBEAT options HEARTBEAT_MAX_PERIOD_DEFAULT=15 You can disable it with `sysctl -w kern.heartbeat.max_period=0' at runtime, or use that knob to change the maximum period before the system will crash if not all (online) CPUs have made progress. Here are some manual tests that you can use to exercise it -- these are manual tests, not automatic tests, because some will deliberately crash the kernel to make sure the diagnostic works, and the others, if broken, will also crash the kernel. Notes: - The magic numbers for debug.crashme.spl_spinout are for evbarm. On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1. For other architectures, consult the source for the numbers to use. - If you're on a single-CPU system, skip the cpuctl offline/online tests and just do (4) and (5). - If you're on a >2-CPU system, then for the cpuctl offline/online tests, try offlining all CPUs but one at a time. 1. cpuctl offline 0 sleep 20 cpuctl online 0 2. cpuctl offline 1 sleep 20 cpuctl online 1 3. cpuctl offline 0 sysctl -w kern.heartbeat.max_period=5 sleep 10 sysctl -w kern.heartbeat.max_period=0 sleep 10 sysctl -w kern.heartbeat.max_period=15 sleep 20 cpuctl online 0 4. sysctl -w debug.crashme_enable=1 sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK # verify system panics after 15sec 5. sysctl -w debug.crashme_enable=1 sysctl -w debug.crashme.spl_spinout=6 # IPL_SCHED # verify system panics after 15sec 6. cpuctl offline 0 sysctl -w debug.crashme_enable=1 sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK # verify system panics after 15sec 7. cpuctl offline 0 sysctl -w debug.crashme_enable=1 sysctl -w debug.crashme.spl_spinout=5 # IPL_VM # verify system panics after 15sec