Hi everyone,
Hi Corey,

(I hope everyone had a good holiday season and made it healthy into 2024!)

you might remember that I was chasing mysterious watchdog reboots without any 
specific issues being shown on the serial console or on the SEL.

In December we stumbled over an insight that has given us a valuable clue. We 
did have a reduced timeout for the watchdog to trigger (60 seconds, systemd was 
signalling every 20 seconds).

I *think* this may have lead to either false positives *OR* just plainly 
shadowed lockups/stalls that the kernel might have issued but needed more time 
for the detectors to find and report them.

We have increased our watchdog timeouts to 5 minutes now and have even decided 
to remove watchdogs from KVM hosts (keeping then enabled on routers, backup 
servers and Ceph servers as those will not cause service interruptions when a 
watchdog comes in).

We’ve not seen an actual lock up / stall since the last 3 weeks, yet, but I 
think we did solve a significant part of the mystery and maybe reporting it 
here helps recording it for posterity and might help someone else in the future.

Thanks for the help so far!
Christian

-- 
Christian Theune · c...@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to