> On 9. Nov 2023, at 19:00, Christian Theune <c...@flyingcircus.io> wrote:
> 
> And so, looking around I find:
> 
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
> # CONFIG_SOFTLOCKUP_DETECTOR is not set
> CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
> # CONFIG_HARDLOCKUP_DETECTOR is not set
> CONFIG_TEST_LOCKUP=m
> 
> Reading the kernel docs about this seems like this might be an oversight from 
> our side and we might be experiencing lockups that do not result in panics 
> (which in turn thus won’t show up in the SEL).
> 
> I guess we’ve found some more homework we can do on our side to get better 
> visibility.
> 
> AFAICT I can easily trigger an NMI from ipmi and then verify that this causes 
> proper SEL entries … 
> 
> (In any case, the NMI lockup isn’t 100% convincing as it completely stopped 
> happening since we attached the SOLs …)

So. I’ve been going through our kernel config enabling those options and going 
through the NMI things you mentioned with the BMC nmi handling and found 
something awkward: it appears that our fleet is enabling different watchdogs:

Newer machines seem to enable the sp5100_tco

watchdog               24576  1 sp5100_tco
ipmi_watchdog          32768  1

Whereas older machines seem to enable the iTCO_wdt

watchdog               24576  2 iTCO_wdt
ipmi_watchdog          32768  0

watchdog               24576  1 iTCO_wdt
ipmi_watchdog          32768  1

Interestingly the BMC still accounts for this, so there seems to be some 
integration going on (this is from the sp5100_tco)

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x10
Initial Countdown:      60 sec
Present Countdown:      47 sec

So I guess we slipped here. My intention was that we always use the IPMI 
watchdog and it appears that has deviated but wasn’t noticed. Maybe this also 
explains the silence … ?

I’ll get this straightened out and combine that with the SOFTLOCKUP and 
HARDLOCKUP panic settings, maybe this will improve visibility into why things 
are crashing … -_-

Christian

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to