Hi all,

  I was working on our OCF RA, and had a bug where the RA hung. (specifically, a DNS query returned a fake IP, probably a search engine after entering an invalid domain, and the RA hung checking if the target was in ~/.ssh/known_hosts). Specifically, I was trying to do a migration, which of course timed out and went into a FAILED state.

  I expected the FAILED state, but after that, both nodes were repeatedly showing:

====
Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  warning: Delaying join-33 finalization while transition in progress Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  warning: Delaying join-33 finalization while transition in progress
====

  I could not do a 'pcs resource cleanup', I could not withdraw the node I triggered the migration from, and even after I fenced the node that I had run the migration from, the peer remained stuck. In the end, I had to reboot both nodes in the pacemaker cluster.

  This was a dev system, so no harm, but now I am worried something could leave a production system hung. How would you recover from a situation like this, without rebooting?

Madi

--
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to