> On 9. Nov 2023, at 19:00, Christian Theune <c...@flyingcircus.io> wrote: > > And so, looking around I find: > > CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y > # CONFIG_SOFTLOCKUP_DETECTOR is not set > CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y > # CONFIG_HARDLOCKUP_DETECTOR is not set > CONFIG_TEST_LOCKUP=m > > Reading the kernel docs about this seems like this might be an oversight from > our side and we might be experiencing lockups that do not result in panics > (which in turn thus won’t show up in the SEL). > > I guess we’ve found some more homework we can do on our side to get better > visibility. > > AFAICT I can easily trigger an NMI from ipmi and then verify that this causes > proper SEL entries … > > (In any case, the NMI lockup isn’t 100% convincing as it completely stopped > happening since we attached the SOLs …)
So. I’ve been going through our kernel config enabling those options and going through the NMI things you mentioned with the BMC nmi handling and found something awkward: it appears that our fleet is enabling different watchdogs: Newer machines seem to enable the sp5100_tco watchdog 24576 1 sp5100_tco ipmi_watchdog 32768 1 Whereas older machines seem to enable the iTCO_wdt watchdog 24576 2 iTCO_wdt ipmi_watchdog 32768 0 watchdog 24576 1 iTCO_wdt ipmi_watchdog 32768 1 Interestingly the BMC still accounts for this, so there seems to be some integration going on (this is from the sp5100_tco) Watchdog Timer Use: SMS/OS (0x44) Watchdog Timer Is: Started/Running Watchdog Timer Actions: Hard Reset (0x01) Pre-timeout interval: 0 seconds Timer Expiration Flags: 0x10 Initial Countdown: 60 sec Present Countdown: 47 sec So I guess we slipped here. My intention was that we always use the IPMI watchdog and it appears that has deviated but wasn’t noticed. Maybe this also explains the silence … ? I’ll get this straightened out and combine that with the SOFTLOCKUP and HARDLOCKUP panic settings, maybe this will improve visibility into why things are crashing … -_- Christian -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer