On 26/08/14 07:56, Vasil Valchev wrote:
Hello,
I have a cluster that sometimes has intermittent network issues on the
heartbeat network.
Unfortunately improving the network is not an option, so I am looking
for a way to tolerate longer interruptions.
Previously it seemed to me the post_fail_delay option is suitable, but
after some research it might not be what I am looking for.
If I am correct, when a member leaves (due to token timeout) the cluster
will wait the post_fail_delay before fencing. If the member rejoins
before that, it will still be fenced, because it has previous state?
From a recent fencing on this cluster there is a strange message:
Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
despite it rejoining the cluster with existing state, it has a lower node ID
What does this mean?
It's an attempt by cman to sort out which node to kill in the situation
where a node rejoins too quickly. If both nodes try to send a 'kill'
message then then both nodes would leave the cluster leaving you with no
active nodes. So cman (and fencing) prioritise the node with the lowest
nodeID in an attempt at a tie-break. you should see a corresponding
message on the other node:
"Killing node %s because it has rejoined the cluster with existing state
and has higher node ID"
And lastly is increasing the totem token timeout the way to go?
if there is no option for improving the network situation then, yes,
increasing token timeout is probably your best option.
Chrissie
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster