Re: [ClusterLabs] Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

Gao,Yan Mon, 04 Dec 2017 04:09:43 -0800

On 12/02/2017 08:30 AM, Andrei Borzenkov wrote:

01.12.2017 22:36, Gao,Yan пишет:

On 11/30/2017 06:48 PM, Andrei Borzenkov wrote:

30.11.2017 16:11, Klaus Wenninger пишет:

On 11/30/2017 01:41 PM, Ulrich Windl wrote:

"Gao,Yan" <y...@suse.com> schrieb am 30.11.2017 um 11:48 in
Nachricht

<e71afccc-06e3-97dd-c66a-1b4bac550...@suse.com>:

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:

SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot

Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down

Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

As msgwait was intended for the message to arrive, and not for the
reboot time (I guess), this just shows a fundamental problem in SBD
design: Receipt of the fencing command is not confirmed (other than
by seeing the consequences of ist execution).


The 2 x msgwait is not for confirmations but for writing the poison-pill
and for
having it read by the target-side.


Yes, of course, but that's not what Urlich likely intended to say.
msgwait must account for worst case storage path latency, while in
normal cases it happens much faster. If fenced node could acknowledge
having been killed after reboot, stonith agent could return success much
earlier.

How could an alive man be sure he died before? ;)


It does not need to. It simply needs to write something on startup to
indicate it is back.

It does that. The thing is the sender cannot just assume that the targetis ever gone based on that.

And it doesn't make sense that a fencing returns success when the targetappears to be alive. If the sender kept watching the slot, probably it'dmake more sense to let fencing return failure and try it again.


Regards,
  Yan


Actually, fenced side already does it - it clears pending message when
sbd is started. It is fencing side that simply unconditionally sleeps
for msgwait:

         if (mbox_write_verify(st, mbox, s_mbox) < -1) {
                 rc = -1; goto out;
         }
         if (strcasecmp(cmd, "exit") != 0) {
                 cl_log(LOG_INFO, "Messaging delay: %d",
                                 (int)timeout_msgwait);
                 sleep(timeout_msgwait);
         }

What if we do not sleep but rather periodically check slot for
acknowledgement for msgwait timeout? Then we could return earlier.

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

Reply via email to