On Tue, 2020-03-31 at 08:56 +0300, Andrei Borzenkov wrote: > 31.03.2020 05:56, Ken Gaillot пишет: > > On Sat, 2020-02-22 at 03:50 +0200, Strahil Nikolov wrote: > > > Hello community, > > > > > > Recently I have started playing with fence_mpath and I have > > > noticed > > > that when the node is fenced, the node is kicked out of the > > > cluster (corosync & pacemaker are shut down). > > > > > > Fencing works correctly , but the IP address cannot be brought up > > > on > > > the designated 'replacement' host, because it was left on the old > > > node. > > > > > > I believe that this is a timing issue - fenced node doesn't have > > > the > > > time to shutdown all it's resources before pacemaker dies > > > locally. > > > > > > Can someone confirm this behaviour on anothger distro, as I'm > > > currently testing it on RHEL7? If it is only for RedHat, I can > > > open > > > a bug in the bugzilla. > > > > > > Note: There is a workaround in order to reboot the node (using > > > a symbolic link to /etc/watchdog.d ) with the help of the > > > fence_scsi or the fence_mpath scripts in /usr/share/cluster . > > > > > > > > > Best Regards, > > > Strahil Nikolov > > > > I'm not expert with fabric fencing, but from what I understand, > > this is > > an inherent limitation. Cutting off the disk obviously has no > > effect on > > resources (like an IP) that don't require that disk. > > > > Pacemaker 2.0.3 added a new cluster property, "fence-reaction", > > that > > controls what a node does when notified of its own fencing. That's > > intended for cases like this (though it only is useful if the node > > is > > still functioning well enough to process the notification). The > > default > > of "stop" is pacemaker's traditional response -- immediately stop > > pacemaker itself, which can leave resources running. Using "panic" > > will > > make pacemaker halt the node instead. > > > > In theory, the ideal solution would be to use a fencing topology to > > combine disk fencing with network access fencing via a smart > > switch. > > However there is a bug with that setup. > > > > Could you elaborate or point to bug report?
I knew someone would ask that ;) I've searched through BZs and my notes and can't find it, and I can't remember the details. I know a network-fenced node needs to be manually unfenced before it can rejoin the cluster, because obviously it can't send a join request to the cluster to trigger unfencing. But there was some issue unique to the combination of disk and network fencing that I can't recall at the moment. > > I'm not sure what people have traditionally done about the problem. > > > > In cases I am aware of either there are no additional resources (like > SAP HANA scale out multi-node database where there are no IP failover > - > clients are aware of topology and connect to each individual node) or > node is completely cut off (consider clients with LAN access only - > if > you cut off network it does not matter whether node is still alive). > > But yes, it is very unfortunate that "stonith" and "fencing" are > mixed > in pacemaker documentation because thy are really very different > things > and cannot in general be used interchangeably. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/