Re: [Linux-cluster] totem token & post_fail_delay question

Christine Caulfield Tue, 26 Aug 2014 01:31:33 -0700

On 26/08/14 07:56, Vasil Valchev wrote:

Hello,


I have a cluster that sometimes has intermittent network issues on the
heartbeat network.
Unfortunately improving the network is not an option, so I am looking
for a way to tolerate longer interruptions.

Previously it seemed to me the post_fail_delay option is suitable, but
after some research it might not be what I am looking for.

If I am correct, when a member leaves (due to token timeout) the cluster
will wait the post_fail_delay before fencing. If the member rejoins
before that, it will still be fenced, because it has previous state?
 From a recent fencing on this cluster there is a strange message:

Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
despite it rejoining the cluster with existing state, it has a lower node ID

What does this mean?

It's an attempt by cman to sort out which node to kill in the situationwhere a node rejoins too quickly. If both nodes try to send a 'kill'message then then both nodes would leave the cluster leaving you with noactive nodes. So cman (and fencing) prioritise the node with the lowestnodeID in an attempt at a tie-break. you should see a correspondingmessage on the other node:"Killing node %s because it has rejoined the cluster with existing stateand has higher node ID"

And lastly is increasing the totem token timeout the way to go?

if there is no option for improving the network situation then, yes,increasing token timeout is probably your best option.


Chrissie

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] totem token & post_fail_delay question

Reply via email to