On 3/18/2011 9:20 PM, berg...@merctech.com wrote: > The pithy ruminations from "Fabio M. Di Nitto" <fdini...@redhat.com> on "Re: > [Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were: > > > => > => Wouldn´t it be possible for the agent to: > => > => 1) issue OFF command > => 2) either poll for OFF status or wait > $known_random_max_delay > => 3) issue ON command > => 4) profit? > > > Yes, but here's the problem: > > 0) there's a condition whereby cluster communication is lost between > nodeA and nodeB > 1) the agent on nodeA sends OFF command to PDU to shut down nodeB > 2) the agent on nodeA polls for OFF status while waiting > > $known_random_max_delay > 3) the agent on nodeB sends OFF command to PDU to shut down nodeA > 4) nodeB shuts down > 5) nodeA shuts down > > The PDU responds quickly to network connections (ie., telnet & commands to > shut down a power outlet). The PDU accepts multiple network sessions (ie., > from nodeA and nodeB). The PDU delays executing the commands, potentially > leaving enough time for multiple nodes to send commands each to shut down the > "other" node.
This is virtually true for all 2 nodes clusters and it´s a very well known fencing race condition. there are several mechanisms to avoid it: 1) fence delay option. One node basically sleeps N seconds before it can fence 2) both cluster heartbeat traffic and fence devices are on the same network (if node A can´t access the net, it also can´t access the fence device) 3) qdiskd + heuristics 4) use a fence device that allows only one connection at a time (one node access, the other is forbidden) and note that it is independent on how long the device takes to fence the node. Fabio -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster