Just updating that I added another level of fencing using watchdog-fencing. With the quorum device and this combination works in case of power failure of both server and ipmi interface. An important note is that the stonith-watchdog-timeout must be configured in order to work. After reading the following great post: http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit , I choose the softdog watchdog since the I don't think ipmi watchdog will do no good in case the ipmi interface is down (If it is OK it will be used as a fencing method).
Just for documenting the solution (in case someone else needed that), the configuration I added is: systemctl enable sbd pcs property set no-quorum-policy=suicide pcs property set stonith-watchdog-timeout=15 pcs quorum device add model net host=qdevice algorithm=lms I just can't decide if the qdevice algorithm should be lms or ffsplit. I couldn't determine the difference between them and I'm not sure which one is the best when using two node cluster with qdevice and watchdog fencing. Can anyone advise on that? -----Original Message----- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Tuesday, July 25, 2017 11:59 AM To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>; kwenn...@redhat.com; Prasad, Shashank <sspra...@vanu.com> Subject: Re: [ClusterLabs] Two nodes cluster issue > Tomer Azran napsal(a): >> I tend to agree with Klaus – I don't think that having a hook that >> bypass stonith is the right way. It is better to not use stonith at all. >> I think I will try to use an iScsi target on my qdevice and set SBD >> to use it. >> I still don't understand why qdevice can't take the place SBD with >> shared storage; correct me if I'm wrong, but it looks like both of >> them are there for the same reason. > > Qdevice is there to be third side arbiter who decides which partition > is quorate. It can also be seen as a quorum only node. So for two node > cluster it can be viewed as a third node (eventho it is quite special > because it cannot run resources). It is not doing fencing. > > SBD is fencing device. It is using disk as a third side arbiter. I've talked with Klaus and he told me that 7.3 is not using disk as a third side arbiter so sorry for confusion. You should however still be able to use sbd for checking if pacemaker is alive and if the partition has quorum - otherwise the watchdog kills the node. So qdevice will give you "3rd" node and sbd fences unquorate partition. Or (as mentioned previously) you can use fabric fencing. Regards, Honza > > >> >> From: Klaus Wenninger [mailto:kwenn...@redhat.com] >> Sent: Monday, July 24, 2017 9:01 PM >> To: Cluster Labs - All topics related to open-source clustering >> welcomed <users@clusterlabs.org>; Prasad, Shashank >> <sspra...@vanu.com> >> Subject: Re: [ClusterLabs] Two nodes cluster issue >> >> On 07/24/2017 07:32 PM, Prasad, Shashank wrote: >> Sometimes IPMI fence devices use shared power of the node, and it >> cannot be avoided. >> In such scenarios the HA cluster is NOT able to handle the power >> failure of a node, since the power is shared with its own fence device. >> The failure of IPMI based fencing can also exist due to other reasons >> also. >> >> A failure to fence the failed node will cause cluster to be marked >> UNCLEAN. >> To get over it, the following command needs to be invoked on the >> surviving node. >> >> pcs stonith confirm <failed_node_name> --force >> >> This can be automated by hooking a recovery script, when the the >> Stonith resource ‘Timed Out’ event. >> To be more specific, the Pacemaker Alerts can be used for watch for >> Stonith timeouts and failures. >> In that script, all that’s essentially to be executed is the >> aforementioned command. >> >> If I get you right here you can disable fencing then in the first place. >> Actually quorum-based-watchdog-fencing is the way to do this in a >> safe manner. This of course assumes you have a proper source for >> quorum in your 2-node-setup with e.g. qdevice or using a shared disk >> with sbd (not directly pacemaker quorum here but similar thing >> handled inside sbd). >> >> >> Since the alerts are issued from ‘hacluster’ login, sudo permissions >> for ‘hacluster’ needs to be configured. >> >> Thanx. >> >> >> From: Klaus Wenninger [mailto:kwenn...@redhat.com] >> Sent: Monday, July 24, 2017 9:24 PM >> To: Kristián Feldsam; Cluster Labs - All topics related to >> open-source clustering welcomed >> Subject: Re: [ClusterLabs] Two nodes cluster issue >> >> On 07/24/2017 05:37 PM, Kristián Feldsam wrote: >> I personally think that power off node by switched pdu is more safe, >> or not? >> >> True if that is working in you environment. If you can't do a >> physical setup where you aren't simultaneously loosing connection to >> both your node and the switch-device (or you just want to cover cases >> where that happens) you have to come up with something else. >> >> >> >> >> S pozdravem Kristián Feldsam >> Tel.: +420 773 303 353, +421 944 137 535 >> E-mail.: supp...@feldhost.cz<mailto:supp...@feldhost.cz> >> >> www.feldhost.cz<http://www.feldhost.cz> - FeldHost™ – profesionální >> hostingové a serverové služby za adekvátní ceny. >> >> FELDSAM s.r.o. >> V rohu 434/3 >> Praha 4 – Libuš, PSČ 142 00 >> IČ: 290 60 958, DIČ: CZ290 60 958 >> C 200350 vedená u Městského soudu v Praze >> >> Banka: Fio banka a.s. >> Číslo účtu: 2400330446/2010 >> BIC: FIOBCZPPXX >> IBAN: CZ82 2010 0000 0024 0033 0446 >> >> On 24 Jul 2017, at 17:27, Klaus Wenninger >> <kwenn...@redhat.com<mailto:kwenn...@redhat.com>> wrote: >> >> On 07/24/2017 05:15 PM, Tomer Azran wrote: >> I still don't understand why the qdevice concept doesn't help on this >> situation. Since the master node is down, I would expect the quorum >> to declare it as dead. >> Why doesn't it happens? >> >> That is not how quorum works. It just limits the decision-making to >> the quorate subset of the cluster. >> Still the unknown nodes are not sure to be down. >> That is why I suggested to have quorum-based watchdog-fencing with sbd. >> That would assure that within a certain time all nodes of the >> non-quorate part of the cluster are down. >> >> >> >> >> >> >> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" >> <dmitri.maz...@gmail.com<mailto:dmitri.maz...@gmail.com>> wrote: >> >> On 2017-07-24 07:51, Tomer Azran wrote: >> >>> We don't have the ability to use it. >> >>> Is that the only solution? >> >> >> >> No, but I'd recommend thinking about it first. Are you sure you will >> >> care about your cluster working when your server room is on fire? >> 'Cause >> >> unless you have halon suppression, your server room is a complete >> >> write-off anyway. (Think water from sprinklers hitting rich chunky >> volts >> >> in the servers.) >> >> >> >> Dima >> >> >> >> _______________________________________________ >> >> Users mailing list: >> Users@clusterlabs.org<mailto:Users@clusterlabs.org> >> >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/> >> >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/> >> >> >> >> >> >> _______________________________________________ >> >> Users mailing list: >> Users@clusterlabs.org<mailto:Users@clusterlabs.org> >> >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/> >> >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/> >> >> >> -- >> >> Klaus Wenninger >> >> >> >> Senior Software Engineer, EMEA ENG Openstack Infrastructure >> >> >> >> Red Hat >> >> >> >> kwenn...@redhat.com<mailto:kwenn...@redhat.com> >> _______________________________________________ >> Users mailing list: >> Users@clusterlabs.org<mailto:Users@clusterlabs.org> >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org<http://www.clusterlabs.org/> >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org<http://bugs.clusterlabs.org/> >> >> >> >> >> >> _______________________________________________ >> >> Users mailing list: >> Users@clusterlabs.org<mailto:Users@clusterlabs.org> >> >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> >> >> Project Home: http://www.clusterlabs.org >> >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> >> Bugs: http://bugs.clusterlabs.org >> >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org