> Date: Sat, 12 Aug 2023 17:28:27 +0200 > From: Thomas Klausner <w...@netbsd.org> > > I just got a new panic in 10.99.7 after running a pbulk for less than > a day (after updating from 10.99.5, which was stable for weeks). > ... > vpanic() at netbsd:vpanic+0x173 panic() at netbsd :panic+0x3c > defibrillate() at netbsd:defibrillate+Oxe3 hardclock() at > netbsd:hardclock+0x8b > Xresume_lapic_ltimer() at netbsd:Xresume_lapic_ltimer+Oxle > --- interrupt --- > pmap_tlb_shootnow() at netbsd:pmap_tlb_shootnow+0x1f7 > ...
This panic means that one CPU has detected that another CPU has failed to run either the hardclock interrupt handler or the SOFTINT_CLOCK softints in over 15 seconds, and triggered an interprocessor interrupt in an attempt to panic rather than stay stuck where it appears to be stuck -- here, pmap_tlb_shootnow. Normally the hardclock interrupt handler runs every 10ms (or 1/hz sec; default hz=100), and softints run reasonably promptly, so failing to do this for 15 sec is extremely unusual and likely indicates a CPU is wedged and unable to make progress. For example, something may be stuck in an infinite loop with a spin lock held or spl raised, which blocks interrupts. (The HEARTBEAT option, this system where CPUs check one another for progress, is new as of last month. The problems it uncovers would likely have manifested as silent unresponsive hang before.) 1. Did you notice anything sluggish before the crash? 2. Can you start another bulk build and run the following dtrace script for a while and share the final output? dtrace -x cleanrate=50hz -n ' fbt::pmap_tlb_shootnow:entry, fbt::uvm_pagermapout:entry { self->starttime[probefunc] = timestamp } fbt::pmap_tlb_shootnow:return, fbt::uvm_pagermapout:return /self->starttime[probefunc]/ { @[probefunc] = quantize(timestamp - self->starttime[probefunc]); self->starttime[probefunc] = 0 } tick-60s { printa(@) } ' You may need to modload dtrace_fbt and dtrace_profile first. The tick-60s probe will print the current state of data collection once a minute, showing a histogram of the time spent in the functions pmap_tlb_shootnow and uvm_pagermapout. If it says something like dtrace: 429 dynamic variable drops with non-empty dirty list then just hit ^C and save the last output. > Sorry, no crash dump available. 3. Do you just not have a dump device, or are crash dumps broken altogether? Can you test with sysctl debug.crashme? (sysctl -w debug.crashme_enable=1, sysctl -w debug.crashme.panic=1)