On 8/18/20 9:07 PM, Andrei Borzenkov wrote: > 18.08.2020 17:02, Ken Gaillot пишет: >> On Tue, 2020-08-18 at 08:21 +0200, Klaus Wenninger wrote: >>> On 8/18/20 7:49 AM, Andrei Borzenkov wrote: >>>> 17.08.2020 23:39, Jehan-Guillaume de Rorthais пишет: >>>>> On Mon, 17 Aug 2020 10:19:45 -0500 >>>>> Ken Gaillot <kgail...@redhat.com> wrote: >>>>> >>>>>> On Fri, 2020-08-14 at 15:09 +0200, Gabriele Bulfon wrote: >>>>>>> Thanks to all your suggestions, I now have the systems with >>>>>>> stonith >>>>>>> configured on ipmi. >>>>>> A word of caution: if the IPMI is on-board -- i.e. it shares >>>>>> the same >>>>>> power supply as the computer -- power becomes a single point of >>>>>> failure. If the node loses power, the other node can't fence >>>>>> because >>>>>> the IPMI is also down, and the cluster can't recover. >>>>>> >>>>>> Some on-board IPMI controllers can share an Ethernet port with >>>>>> the main >>>>>> computer, which would be a similar situation. >>>>>> >>>>>> It's best to have a backup fencing method when using IPMI as >>>>>> the >>>>>> primary fencing method. An example would be an intelligent >>>>>> power switch >>>>>> or sbd. >>>>> How SBD would be useful in this scenario? Poison pill will not be >>>>> swallowed by >>>>> the dead node... Is it just to wait for the watchdog timeout? >>>>> >>>> Node is expected to commit suicide if SBD lost access to shared >>>> block >>>> device. So either node swallowed poison pill and died or node died >>>> because it realized it was impossible to see poison pill or node >>>> was >>>> dead already. After watchdog timeout (twice watchdog timeout for >>>> safety) >>>> we assume node is dead. >>> Yes, like this a suicide via watchdog will be triggered if there are >>> issues with thedisk. This is why it is important to have a reliable >>> watchdog with SBD even whenusing poison pill. As this alone would >>> make a single shared disk a SPOF, runningwith pacemaker integration >>> (default) a node with SBD will survive despite ofloosing the disk >>> when it has quorum and pacemaker looks healthy. As corosync-quorum >>> in 2-node-mode obviously won't be fit for this purpose SBD will >>> switch >>> to checking for presence of both nodes if 2-node-flag is set. >>> >>> Sorry for the lengthy explanation but the full picture is required >>> to understand whyit is sufficiently reliable and useful if configured >>> correctly. >>> >>> Klaus >> What I'm not sure about is how watchdog-only sbd would behave as a >> fail-back method for a regular fence device. Will the cluster wait for >> the sbd timeout no matter what, or only if the regular fencing fails, >> or ...? >> > Diskless SBD implicitly creates fencing device ("watchdog"), timeout > starts only when this device is selected for fencing. This device > appears to be completely invisible to normal stonith_admin operation, I > do not know how to query for it. In my testing explicit stonith resource > was always called first and only if it failed was "watchdog" self > fencing attempted. I tried to set negative priority for CIB stonith > resource but it did not change anything. > This matches with what I remember from going through the code ... like with lowest prio but not at all if there is a topology defined ... which probably should be overhauled ... If interested there is a branch about having just certain nodes watchdog-fenced on my pacemaker-clone that makes the watchdog device visible.
Klaus _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/