On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:
> 04.12.2017 14:48, Gao,Yan пишет:
> > On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
> >> 30.11.2017 13:48, Gao,Yan пишет:
> >>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
> >>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
> >>>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
> >>>> corosync and forcing STONITH pacemaker was not started after reboot.
> >>>> In logs I see during boot
> >>>>
> >>>> Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
> >>>> just fenced by sapprod01p for sapprod01p
> >>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
> >>>> process (3151) can no longer be respawned,
> >>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
> >>>> Pacemaker
> >>>>
> >>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
> >>>> stonith with SBD always takes msgwait (at least, visually host is not
> >>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
> >>>> and is up and running long before timeout expires.
> >>>>
> >>>> I think I have seen similar report already. Is it something that can
> >>>> be fixed by SBD/pacemaker tuning?
> >>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
> >>>
> >>
> >> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
> >> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
> >> disk at all. 
> > It simply waits that long on startup before starting the rest of the
> > cluster stack to make sure the fencing that targeted it has returned. It
> > intentionally doesn't watch anything during this period of time.
> > 
> 
> Unfortunately it waits too long.
> 
> ha1:~ # systemctl status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
>    Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
>    Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
> 4min 16s ago
>   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
> status=0/SUCCESS)
>   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=killed, signa
>  Main PID: 1792 (code=exited, status=0/SUCCESS)
> 
> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
> daemon...
> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
> Terminating.
> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
> fencing daemon.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
> 
> But the real problem is - in spite of SBD failed to start, the whole
> cluster stack continues to run; and because SBD blindly trusts in well
> behaving nodes, fencing appears to succeed after timeout ... without
> anyone taking any action on poison pill ...

That's something I always wondered about: if a node is capable of
reading a poison pill then it could before shutdown also write an
"I'm leaving" message into its slot. Wouldn't that make sbd more
reliable? Any reason not to implement that?

Thanks,

Dejan

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to