Hi all, I'm run into an issue a couple of times now, and I'm not really sure what's causing it. I've got a RHEL 8 cluster that, after a while, will show one or more resources as 'FAILED'. When I try to do a cleanup, it marks the resources as stopped, despite them still running. After that, all attempts to manage the resources cause no change. The pcs command seems to have no effect, and in some cases refuses to return.
The logs from the nodes (filtered for 'pcs' and 'pacem' since boot) are here (resources running on node 2): - https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt - https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt For example, it took 20 minutes for the 'pcs cluster stop' to complete. (Note that I tried restarting the pcsd daemon while waiting) BTW, I see the errors about fence_delay metadata, that will be fixed and I don't believe it's related. Any advice on what happened, how to avoid it, and how to clean up without a full cluster restart, should it happen again? -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
