Hello guys,

it looks like I miss something obvious, but I just don't get what has happened.

I've got a number of stonith-enabled clusters within my big POWER boxes. My stonith devices are two HMC (hardware management consoles) - separate servers from IBM that can reboot separate LPARs (logical partitions) within POWER boxes - one per every datacenter.

So my definition for stonith devices was pretty straightforward:

primitive st_dc2_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9
primitive st_dc1_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8
clone cl_st_dc2_hmc st_dc2_hmc
clone cl_st_dc1_hmc st_dc1_hmc

Everything was ok when we tested failover. But today upon power outage we lost one DC completely. Shortly after that cluster just literally hanged itself upong trying to reboot nonexistent node. No failover occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were marked "Started UNCLEAN" on nonexistent node.

UNCLEAN seems to flag a problems with stonith configuration. So my question is: how to avoid such behaviour?

Thank you!

--
Regards,
Alexander

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to