Hi everyone, Hi Corey, (I hope everyone had a good holiday season and made it healthy into 2024!)
you might remember that I was chasing mysterious watchdog reboots without any specific issues being shown on the serial console or on the SEL. In December we stumbled over an insight that has given us a valuable clue. We did have a reduced timeout for the watchdog to trigger (60 seconds, systemd was signalling every 20 seconds). I *think* this may have lead to either false positives *OR* just plainly shadowed lockups/stalls that the kernel might have issued but needed more time for the detectors to find and report them. We have increased our watchdog timeouts to 5 minutes now and have even decided to remove watchdogs from KVM hosts (keeping then enabled on routers, backup servers and Ceph servers as those will not cause service interruptions when a watchdog comes in). We’ve not seen an actual lock up / stall since the last 3 weeks, yet, but I think we did solve a significant part of the mystery and maybe reporting it here helps recording it for posterity and might help someone else in the future. Thanks for the help so far! Christian -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick _______________________________________________ Openipmi-developer mailing list Openipmi-developer@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openipmi-developer