Hi, On Fri, Mar 25, 2011 at 10:58:51AM +0100, Andrew Beekhof wrote: > On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin <pa...@levshin.spb.ru> wrote: > > Hi. > > > > Today, we had a network outage. Quite a few problems suddenly arised in out > > setup, including crashed corosync, known notify bug in DRBD RA and some > > problem with VirtualDomain RA timeout on stop. > > > > But particularly strange was fencing behaviour. > > > > Initially, one node (wapgw1-1) has parted from the cluster. When connection > > was restored, corosync has died on that node. It was considered "offline > > unclean" and was scheduled to be fenced. Fencing by HP iLO did not work > > (currently, I do not know why). Second priority fencing method is meatware, > > and it did take time. > > > > Second node, wapgw1-2, hit DRBD notify bug and failed to stop some > > resources. It was "online unclean". It also was scheduled to be fenced. HP > > iLO was available for this node, but it had not been STONITHed until I > > manually confirmed STONITH for wapgw1-1. > > > > When I confirmed first node restart, second node was fenced automatically.
This is a very unusual case. > > Is this ordering intended behaviour or a bug? > > A little of both. > > The ordering (in the PE) was added because stonithd wasn't able to > cope with parallel fencing operations. The only issue stonithd may have is if there are stonith resource clones and multiple instances try to reset the same node at the same time and, finally, the device does not support more than one simultaneous session. Otherwise, stonithd has no problems with multiple parallel fencing operations. > I don't know if this is still the case for stonithd in 1.0. Perhaps > Dejan can comment. > > Unfortunately, as you saw, this means that we fence nodes one by one - > and that if op N fails, we never try op > N. > > Ideally the ordering would be removed, lets see what Dejan has to say. Yes, this kind of ordering is not necessary. Multiple nodes may be fenced in parallel. Thanks, Dejan > > > > It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster. > > > > > > -- > > Pavel Levshin > > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker