I have had this small 2 node cluster running since February. This morning
one of the servers (Node2) stopped responding on the external network
interface. To remedy this the server was rebooted at the console. (Not by
me). When the node came back up it was showing the other node offline, and
tried to take over all the services. The Node that was online the whole time
(Node1) had taken over the services of Node2 when it was rebooted, (the
internal network on Node2 was still active and responding), Node1 shows
Node2 offline, Node2 shows Node1 offline. I've put Node2 in standby using
crm so it stopped trying to take back the services, since it was not
co-ordinating with the other node. 

 

How do I get the node back re-joined to the cluster properly? All my
previous experience was that it just rejoined, and the services failed back
over as expected. This is the first time that the expected behavior has not
occurred. 

 

I read another mailing list post regarding something similar, having to do
with nodeid changes. This is not the case here, I verified that the nodeid
in the previous logs matches what the node currently has registered as its
nodeid.

 

That same post recommended deleting Node2 with crm on Node1 and restarting
Node2, along with deleting all of /var/lib/heartbeat/* on Node2 to flush the
CIB. My assumption is that this will sync to the cluster and update
automatically.  Doesn't sound like advice I'd prefer to take blindly, I hate
assuming. 

 

Does anyone have any input that will point me in the right direction? Any
input would be helpful. Thank you. 

 

James Mackie

EZProvider Networks, Inc.

http://ezp.net

1.888.397.7853 x202

 

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to