Hello guys,
it looks like I miss something obvious, but I just don't get what has
happened.
I've got a number of stonith-enabled clusters within my big POWER boxes.
My stonith devices are two HMC (hardware management consoles) - separate
servers from IBM that can reboot separate LPARs (logical partitions)
within POWER boxes - one per every datacenter.
So my definition for stonith devices was pretty straightforward:
primitive st_dc2_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9
primitive st_dc1_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8
clone cl_st_dc2_hmc st_dc2_hmc
clone cl_st_dc1_hmc st_dc1_hmc
Everything was ok when we tested failover. But today upon power outage
we lost one DC completely. Shortly after that cluster just literally
hanged itself upong trying to reboot nonexistent node. No failover
occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were
marked "Started UNCLEAN" on nonexistent node.
UNCLEAN seems to flag a problems with stonith configuration. So my
question is: how to avoid such behaviour?
Thank you!
--
Regards,
Alexander
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org