On Wed, Nov 9, 2022 at 2:58 PM Robert Hayden <robert.h.hay...@oracle.com> wrote:
> > > -----Original Message----- > > From: Users <users-boun...@clusterlabs.org> On Behalf Of Andrei > > Borzenkov > > Sent: Wednesday, November 9, 2022 2:59 AM > > To: Cluster Labs - All topics related to open-source clustering welcomed > > <users@clusterlabs.org> > > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests > > > > On Mon, Nov 7, 2022 at 5:07 PM Robert Hayden > > <robert.h.hay...@oracle.com> wrote: > > > > > > > > > > -----Original Message----- > > > > From: Users <users-boun...@clusterlabs.org> On Behalf Of Valentin > > Vidic > > > > via Users > > > > Sent: Sunday, November 6, 2022 5:20 PM > > > > To: users@clusterlabs.org > > > > Cc: Valentin Vidić <vvi...@valentin-vidic.from.hr> > > > > Subject: Re: [ClusterLabs] [External] : Re: Fence Agent tests > > > > > > > > On Sun, Nov 06, 2022 at 09:08:19PM +0000, Robert Hayden wrote: > > > > > When SBD_PACEMAKER was set to "yes", the lack of network > > connectivity > > > > to the node > > > > > would be seen and acted upon by the remote nodes (evicts and takes > > > > > over ownership of the resources). But the impacted node would just > > > > > sit logging IO errors. Pacemaker would keep updating the > > /dev/watchdog > > > > > device so SBD would not self evict. Once I re-enabled the > network, > > then > > > > the > > > > > > > > Interesting, not sure if this is the expected behaviour based on: > > > > > > > > > > > https://urldefense.com/v3/__https://lists.clusterlabs.org/pipermail/users/2 Which versions of pacemaker/corosync/sbd are you using? iirc a result of the discussion linked was sbd checking watchdog-timeout against sync-timeout in case of qdevice being used. default sync-timeout is 30s and your watchdog-timeout is 20s. So I would expect kind of current sbd should refuse startup. But iirc in the discussion linked the pacemaker-node finally became non-quorate. There was just a possible split-brain-gap when sync-timeout > watchdog-timeout. So if your pacemaker-instance stays quorate it has to be something else rather. > > > > > 017- > > > > > > August/022699.html__;!!ACWV5N9M2RV99hQ!IvnnhGI1HtTBGTKr4VFabWA > > > > LeMfBWNhcS0FHsPFHwwQ3Riu5R3pOYLaQPNia- > > > > GaB38wRJ7Eq4Q3GyT5C3s8y7w$ > > > > > > > > Does SBD log "Majority of devices lost - surviving on pacemaker" or > > > > some other messages related to Pacemaker? > > > > > > Yes. > > > > > > > > > > > Also what is the status of Pacemaker when the network is down? Does > it > > > > report no quorum or something else? > > > > > > > > > > Pacemaker on the failing node shows quorum even though it has lost > > > communication to the Quorum Device and to the other node in the > cluster. > > > The non-failing node of the cluster can see the Quorum Device system > and > > > thus correctly determines to fence the failing node and take over its > > > resources. > Hmm ... maybe some problem with qdevice-setup and/or quorum stategy (LMS for instance). If quorum doesn't work properly your cluster won't work properly regardless of sbd killing the node properly or not. > > > > > > Only after I run firewall-cmd --panic-off, will the failing node start > to log > > > messages about loss of TOTEM and getting a new consensus with the > > > now visible members. > > > > > > > Where exactly do you use firewalld panic mode? You have hosts, you > > have VM, you have qnode ... > > > > Have you verified that the network is blocked bidirectionally? I had > > rather mixed experience with asymmetrical firewalls which resembles > > your description. > > In my testing harness, I will send a script to the remote node which > contains the firewall-cmd --panic-on, a sleep command, and then > turn off the panic mode. That way I can adjust the length of time > network is unavailable on a single node. I used to log into a network > switch to turn ports off, but that is not possible in a Cloud environment. > I have also played with manually creating iptables rules, but the panic > mode > is simply easier and accomplishes the task. > > I have verified that when panic mode is on, no inbound or outbound > network traffic is allowed. This includes iSCSI packets as well. You > better > have access to the console or the ability to reset the system. > > > > > > Also it may depend on the corosync driver in use. > > > > > I think all of that explains the lack of self-fencing when the sbd > setting of > > > SBD_PACEMAKER=yes is used. > Are you aware that when setting SBD_PACEMAKER=no with just a single disk this disk will become a SPOF? Klaus > > > > > > > Correct. This means that at least under some conditions > > pacemaker/corosync fail to detect isolation. > > _______________________________________________ > > Manage your subscription: > > > https://urldefense.com/v3/__https://lists.clusterlabs.org/mailman/listinfo/u > > sers__;!!ACWV5N9M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF- > > yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6YXvIH5c$ > > > > ClusterLabs home: > > https://urldefense.com/v3/__https://www.clusterlabs.org/__;!!ACWV5N9 > > M2RV99hQ!IMFB2Teli90q80SZ0fS4861iqEF- > > yFGiPUvE81iTEJM4MHWMqoPOAxaJL5Fwmyr8py4S4QRvU4INEiY6sVTZv74$ > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/