Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

Gao,Yan Tue, 05 Dec 2017 02:04:22 -0800

On 12/04/2017 07:55 PM, Andrei Borzenkov wrote:

04.12.2017 14:48, Gao,Yan пишет:

On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:

30.11.2017 13:48, Gao,Yan пишет:

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:

SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot


Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

I think I have seen similar report already. Is it something that can
be fixed by SBD/pacemaker tuning?

SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.


I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
disk at all.

It simply waits that long on startup before starting the rest of the
cluster stack to make sure the fencing that targeted it has returned. It
intentionally doesn't watch anything during this period of time.


Unfortunately it waits too long.

ha1:~ # systemctl status sbd.service
● sbd.service - Shared-storage based fencing daemon
    Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
    Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
4min 16s ago
   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
status=0/SUCCESS)
   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=killed, signa
  Main PID: 1792 (code=exited, status=0/SUCCESS)

дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
daemon...
дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
Terminating.
дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
fencing daemon.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.

But the real problem is - in spite of SBD failed to start, the whole
cluster stack continues to run; and because SBD blindly trusts in well
behaving nodes, fencing appears to succeed after timeout ... without
anyone taking any action on poison pill ...

Start of sbd reaches systemd's timeout for starting units and systemdproceeds...

TimeoutStartSec should be configured in sbd.service accordingly to belonger than msgwait.


Regards,
  Yan


ha1:~ # systemctl show sbd.service -p RequiredBy
RequiredBy=corosync.service

but

ha1:~ # systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/usr/lib/systemd/system/corosync.service; static;
vendor preset: disabled)
    Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago
      Docs: man:corosync
            man:corosync.conf
            man:corosync_overview
   Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited,
status=0/SUCCESS)
   Process: 2059 ExecStart=/usr/share/corosync/corosync start
(code=exited, status=0/SUCCESS)
  Main PID: 2073 (corosync)
     Tasks: 2 (limit: 4915)
    CGroup: /system.slice/corosync.service
            └─2073 corosync

and

ha1:~ # crm_mon -1r
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Mon Dec  4 21:53:24 2017
Last change: Mon Dec  4 21:47:25 2017 by hacluster via crmd on ha1

2 nodes configured
1 resource configured

Online: [ ha1 ha2 ]

Full list of resources:

  stonith-sbd   (stonith:external/sbd): Started ha1

and if I now sever connection between two nodes I will get two single
node clusters each believing it won ...

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

Reply via email to