On Sat, Aug 12, 2023 at 04:03:59PM +0000, Taylor R Campbell wrote: > This panic means that one CPU has detected that another CPU has failed > to run either the hardclock interrupt handler or the SOFTINT_CLOCK > softints in over 15 seconds, and triggered an interprocessor interrupt > in an attempt to panic rather than stay stuck where it appears to be > stuck -- here, pmap_tlb_shootnow. > > Normally the hardclock interrupt handler runs every 10ms (or 1/hz sec; > default hz=100), and softints run reasonably promptly, so failing to > do this for 15 sec is extremely unusual and likely indicates a CPU is > wedged and unable to make progress. For example, something may be > stuck in an infinite loop with a spin lock held or spl raised, which > blocks interrupts. > > (The HEARTBEAT option, this system where CPUs check one another for > progress, is new as of last month. The problems it uncovers would > likely have manifested as silent unresponsive hang before.) > > 1. Did you notice anything sluggish before the crash?
I was active on the machine in a remote terminal and I didn't notice anything in particular. The machine has 32 (virtual) cores so I probably wouldn't notice one stuck - except that I would notice if pbulk never finished because one was stuck. I know that this machine is sometimes sluggish when all three parallel pbulk clients want to interact with the disk (e.g. libreoffice and rust unpacking at the same time, or something similar). > 2. Can you start another bulk build and run the following dtrace > script for a while and share the final output? > > dtrace -x cleanrate=50hz -n ' > fbt::pmap_tlb_shootnow:entry, > fbt::uvm_pagermapout:entry { > self->starttime[probefunc] = timestamp > } > fbt::pmap_tlb_shootnow:return, > fbt::uvm_pagermapout:return /self->starttime[probefunc]/ { > @[probefunc] = quantize(timestamp - > self->starttime[probefunc]); > self->starttime[probefunc] = 0 > } > tick-60s { > printa(@) > } > ' > > You may need to modload dtrace_fbt and dtrace_profile first. The > tick-60s probe will print the current state of data collection once a > minute, showing a histogram of the time spent in the functions > pmap_tlb_shootnow and uvm_pagermapout. > > If it says something like > > dtrace: 429 dynamic variable drops with non-empty dirty list > > then just hit ^C and save the last output. I left it running for a bit and this is what it said: # ./dtrace-script-20230812.sh dtrace: description ' fbt::pmap_tlb_shootnow:entry, fbt::uvm_pagermapout:entry ' matched 5 probes dtrace: aggregation size lowered to 2m dtrace: 312026 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 22746 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 255016 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 341982 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 273456 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 11589 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 303313 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 312509 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 345693 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 16682 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 333801 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 327437 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 412853 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 539988 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 471254 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 501274 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 475914 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 501722 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 400591 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 370924 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 395296 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 276777 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 255151 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 300495 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 263274 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 390073 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 316773 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 282110 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 371249 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 40912 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 289925 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 267312 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 293607 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 259538 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 290060 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 313933 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 10325 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 274132 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 297795 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 217786 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 272377 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 290762 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 427197 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 470297 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 434802 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 462612 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 475668 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 412965 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 461865 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 376524 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 26804 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 425786 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 396083 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 396933 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 399609 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 379957 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 469777 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 406434 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 416873 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 432938 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 463341 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 449017 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 501371 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 442401 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 459855 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 422771 dynamic variable drops with non-empty dirty list : Operation timed out CPU ID FUNCTION:NAME 6 58273 :tick-60s uvm_pagermapout value ------------- Distribution ------------- count 1024 | 0 2048 | 1 4096 |@@@@@@@@@@@@ 2841 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 5659 16384 |@@@ 750 32768 | 48 65536 | 9 131072 | 0 pmap_tlb_shootnow value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@@@@@@@@@@@@@@@@@ 3770128 2048 |@@@@@@@@@@@@@@@@ 2846514 4096 |@@ 316604 8192 | 16533 16384 | 2766 32768 | 621 65536 | 261 131072 | 26 262144 | 1 524288 | 14 1048576 | 3 2097152 | 0 dtrace: 466696 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 452047 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 415726 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 433259 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 710584 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 600722 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 696242 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 751837 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 438584 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 420030 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 418731 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 380829 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 16487 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 370819 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 404504 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 450560 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 446036 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 376850 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 376100 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 384559 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 352970 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 11701 dynamic variable drops with non-empty rinsing list : Operation timed out dtrace: 361743 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 389721 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 322743 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 377573 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 386278 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 371971 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 378625 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 378767 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 368735 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 369729 dynamic variable drops with non-empty dirty list : Operation timed out dtrace: 432366 dynamic variable drops with non-empty dirty list : Operation timed out ^C dtrace: 518343 dynamic variable drops with non-empty dirty list : Operation timed out > > Sorry, no crash dump available. > > 3. Do you just not have a dump device, or are crash dumps broken > altogether? Can you test with sysctl debug.crashme? (sysctl -w > debug.crashme_enable=1, sysctl -w debug.crashme.panic=1) I tried 'reboot 0x104' (or 0x108, I'm not sure any longer) and it started the dump but didn't start counting down, so I reset the machine. Thomas