Hi, On Wed, Jan 28, 2015 at 01:53:17PM -0500, Jérôme Charaoui wrote: > Hi, > > I'm testing a 2-node Corosync (1.4.6) and Pacemaker > (1.1.10+git20130802) cluster on Debian 8.0 and having some problems > with the stonith resources. > > I've set up two external/ipmi resources on each node and wanted to > test how they would react by physically unplugging the IPMI device > network interfaces. > > On the DC, no problem, the resource monitor fails, stop op succeeds > and due to location constraints, as expected the resource enters the > stop state and stays there. After replugging the network cable and > cleaningup the resource, it gets restored to normal state. > > On the slave node, different scenario: after monitor op fails, stop > op also fails for an unknown reason. The cluster then retries the
The stop operation for stonith devices does not involve the device at all, it's just stonithd operation, something like "disable resource". From the "slave" logs, after some abort, Jan 28 12:04:22 [31422] scatlas01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it stonithd exits: Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: st_child_term: Child 16540 timed out, sending SIGTERM Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: stonith_shutdown: Terminating with 2 clients Apparently, there're a number of stop operations started, for the same resource, which all exited (or got cancelled) around 12:29:09. There probably was some confusion in lrmd after stonithd left. In short, you ran into a bug, but I guess that that bug got fixed in the meantime. Beekhof and David Vossel should know. Thanks, Dejan > stop operation unsuccessfully until I have the node enter/exit > standby mode. Replugging the network cable on the IPMI device has no > effect. > > At least, that's what I figure is happenning from these logs: > > DC: http://pastebin.com/raw.php?i=QpwG6nea > Slave: http://pastebin.com/raw.php?i=3nesX8yJ > Config: http://pastebin.com/raw.php?i=3FrJuwWz > > Any help tracking down the issue would be much appreciated. > > Thanks! > > -- > Jérôme Charaoui > Technicien informatique > Collège de Maisonneuve > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org