On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote: > Hi all, > > I'm run into an issue a couple of times now, and I'm not really > sure > what's causing it. I've got a RHEL 8 cluster that, after a while, > will > show one or more resources as 'FAILED'. When I try to do a cleanup, > it > marks the resources as stopped, despite them still running. After > that, > all attempts to manage the resources cause no change. The pcs command > seems to have no effect, and in some cases refuses to return. > > The logs from the nodes (filtered for 'pcs' and 'pacem' since boot) > are > here (resources running on node 2): > > - > https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt
The SNMP fence agent fails to start: May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: fence_apc_snmp[12842] stderr: [ ] May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: fence_apc_snmp[12842] stderr: [ ] May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: fence_apc_snmp[12842] stderr: [ 2021-05-12 23:29:25,955 ERROR: Please use '-h' for usage ] May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: warning: fence_apc_snmp[12842] stderr: [ ] May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]: notice: Operation 'monitor' [12842] for device 'apc_snmp_node2_an-pdu02' returned: -201 (Generic Pacemaker error) May 12 23:29:25 an-a02n01.alteeve.com pacemaker-controld[5951]: notice: Result of start operation for apc_snmp_node2_an-pdu02 on an-a02n01: error which is fatal (because start-failure-is-fatal=true): May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting fail-count-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> INFINITY May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting last-failure-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> 1620876566 That happens for both devices on both nodes, so they get stopped (successfully), which effectively disables them from being used, though I don't see them needed in these logs so it wouldn't matter. It looks like you did a cleanup here: May 14 14:19:30 an-a02n01.alteeve.com pacemaker-controld[5951]: notice: Forcing the status of all resources to be redetected It's hard to tell what happened after that without the detail log (/var/log/pacemaker/pacemaker.log). The resource history should have been wiped from the CIB, and probes of everything should have been scheduled and executed. But I don't see any scheduler output, which is odd. Then we get a shutdown request, but the node has already left without getting the OK to do so: May 14 14:22:58 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Setting shutdown[an-a02n02]: (unset) -> 1621016578 May 14 14:42:58 an-a02n01.alteeve.com pacemaker-controld[5951]: warning: Stonith/shutdown of node an-a02n02 was not expected May 14 14:42:58 an-a02n01.alteeve.com pacemaker-attrd[5949]: notice: Node an-a02n02 state is now lost The log ends there so I'm not sure what happens after that. I'd expect this node to want to fence the other one. Since the fence devices are failed, that can't happen, so that could be why the node is unable to shut down itself. > - > https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt > > For example, it took 20 minutes for the 'pcs cluster stop' to > complete. (Note that I tried restarting the pcsd daemon while > waiting) > > BTW, I see the errors about fence_delay metadata, that will be > fixed > and I don't believe it's related. > > Any advice on what happened, how to avoid it, and how to clean up > without a full cluster restart, should it happen again? > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/