[ClusterLabs] Got pacemaker into a hung state

Madison Kelly Sun, 15 Sep 2024 16:47:48 -0700

Hi all,

I was working on our OCF RA, and had a bug where the RA hung.(specifically, a DNS query returned a fake IP, probably a search engineafter entering an invalid domain, and the RA hung checking if the targetwas in ~/.ssh/known_hosts). Specifically, I was trying to do amigration, which of course timed out and went into a FAILED state.

I expected the FAILED state, but after that, both nodes wererepeatedly showing:


====

Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]: warning: Delaying join-33 finalization while transition in progressSep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]: warning: Delaying join-33 finalization while transition in progress

====

I could not do a 'pcs resource cleanup', I could not withdraw thenode I triggered the migration from, and even after I fenced the nodethat I had run the migration from, the peer remained stuck. In the end,I had to reboot both nodes in the pacemaker cluster.

This was a dev system, so no harm, but now I am worried somethingcould leave a production system hung. How would you recover from asituation like this, without rebooting?


Madi

--
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Got pacemaker into a hung state

Reply via email to