I have had this small 2 node cluster running since February. This morning one of the servers (Node2) stopped responding on the external network interface. To remedy this the server was rebooted at the console. (Not by me). When the node came back up it was showing the other node offline, and tried to take over all the services. The Node that was online the whole time (Node1) had taken over the services of Node2 when it was rebooted, (the internal network on Node2 was still active and responding), Node1 shows Node2 offline, Node2 shows Node1 offline. I've put Node2 in standby using crm so it stopped trying to take back the services, since it was not co-ordinating with the other node.
How do I get the node back re-joined to the cluster properly? All my previous experience was that it just rejoined, and the services failed back over as expected. This is the first time that the expected behavior has not occurred. I read another mailing list post regarding something similar, having to do with nodeid changes. This is not the case here, I verified that the nodeid in the previous logs matches what the node currently has registered as its nodeid. That same post recommended deleting Node2 with crm on Node1 and restarting Node2, along with deleting all of /var/lib/heartbeat/* on Node2 to flush the CIB. My assumption is that this will sync to the cluster and update automatically. Doesn't sound like advice I'd prefer to take blindly, I hate assuming. Does anyone have any input that will point me in the right direction? Any input would be helpful. Thank you. James Mackie EZProvider Networks, Inc. http://ezp.net 1.888.397.7853 x202
_______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais