On 3/18/2011 9:20 PM, berg...@merctech.com wrote:
> The pithy ruminations from "Fabio M. Di Nitto" <fdini...@redhat.com> on "Re: 
> [Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were:
> 
> 
> => 
> => Wouldn´t it be possible for the agent to:
> => 
> => 1) issue OFF command
> => 2) either poll for OFF status or wait > $known_random_max_delay
> => 3) issue ON command
> => 4) profit?
> 
> 
> Yes, but here's the problem:
> 
>       0) there's a condition whereby cluster communication is lost between 
> nodeA and nodeB
>       1) the agent on nodeA sends OFF command to PDU to shut down nodeB
>       2) the agent on nodeA polls for OFF status while waiting > 
> $known_random_max_delay
>       3) the agent on nodeB sends OFF command to PDU to shut down nodeA
>       4) nodeB shuts down
>       5) nodeA shuts down
> 
> The PDU responds quickly to network connections (ie., telnet & commands to 
> shut down a power outlet). The PDU accepts multiple network sessions (ie., 
> from nodeA and nodeB). The PDU delays executing the commands, potentially 
> leaving enough time for multiple nodes to send commands each to shut down the 
> "other" node.

This is virtually true for all 2 nodes clusters and it´s a very well
known fencing race condition.

there are several mechanisms to avoid it:

1) fence delay option. One node basically sleeps N seconds before it can
fence
2) both cluster heartbeat traffic and fence devices are on the same
network (if node A can´t access the net, it also can´t access the fence
device)
3) qdiskd + heuristics
4) use a fence device that allows only one connection at a time (one
node access, the other is forbidden)

and note that it is independent on how long the device takes to fence
the node.

Fabio

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to