FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers).

It's optional for now, but it's small and I'd like to make it
mandatory in the future.  If you'd like to try it out, add the
following two lines to your kernel config:

options         HEARTBEAT
options         HEARTBEAT_MAX_PERIOD_DEFAULT=15

You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
runtime, or use that knob to change the maximum period before the
system will crash if not all (online) CPUs have made progress.


Here are some manual tests that you can use to exercise it -- these
are manual tests, not automatic tests, because some will deliberately
crash the kernel to make sure the diagnostic works, and the others, if
broken, will also crash the kernel.

Notes:
- The magic numbers for debug.crashme.spl_spinout are for evbarm.
  On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
  For other architectures, consult the source for the numbers to use.
- If you're on a single-CPU system, skip the cpuctl offline/online
  tests and just do (4) and (5).
- If you're on a >2-CPU system, then for the cpuctl offline/online
  tests, try offlining all CPUs but one at a time.

1.      cpuctl offline 0
        sleep 20
        cpuctl online 0

2.      cpuctl offline 1
        sleep 20
        cpuctl online 1

3.      cpuctl offline 0
        sysctl -w kern.heartbeat.max_period=5
        sleep 10
        sysctl -w kern.heartbeat.max_period=0
        sleep 10
        sysctl -w kern.heartbeat.max_period=15
        sleep 20
        cpuctl online 0

4.      sysctl -w debug.crashme_enable=1
        sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
        # verify system panics after 15sec

5.      sysctl -w debug.crashme_enable=1
        sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
        # verify system panics after 15sec

6.      cpuctl offline 0
        sysctl -w debug.crashme_enable=1
        sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
        # verify system panics after 15sec

7.      cpuctl offline 0
        sysctl -w debug.crashme_enable=1
        sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
        # verify system panics after 15sec

Reply via email to