On 08/14/2017 03:12 PM, Edwin Török wrote: > On 14/08/17 13:46, Klaus Wenninger wrote: > > How does your /etc/sysconfig/sbd look like? > > With just that pcs-command you get some default-config with > > watchdog-only-support. > > It currently looks like this: > > SBD_DELAY_START=no > SBD_OPTS="-n cluster1" > SBD_PACEMAKER=yes > SBD_STARTMODE=always > SBD_WATCHDOG_DEV=/dev/watchdog > SBD_WATCHDOG_TIMEOUT=5
Ok, no surprises there > > > Without cluster-property stonith-watchdog-timeout set to a > > value matching (twice is a good choice) the watchdog-timeout > > configured in /etc/sysconfig/sbd (default = 5s) a node will never > > assume the unseen partner as fenced. > > Anyway watchdog-only-sbd is of very limited use in 2-node > > scenarios. Kind of limits the availability to the one of the node > > that would win the tie_breaker-game. But might still be useful > > in certain scenarios of course. (like load-sharing ...) > > Good point. Still the question why you didn't set stonith-watchdog-timeout ... > >> On 08/14/2017 12:20 PM, Ulrich Windl wrote: >>> Hi! >>> >>> Have you tried studying the logs? Usually you get useful information >>> from >>> there (to share!). > > Here is journalctl and pacemaker.log output: > > Aug 14 08:57:26 cluster1 crmd[2221]: notice: Result of start > operation for dlm on cluster1: 0 (ok) > Aug 14 08:57:26 cluster1 sbd[2202]: pcmk: info: > set_servant_health: Node state: online > Aug 14 08:57:26 cluster1 sbd[2202]: pcmk: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:26 cluster1 sbd[2199]: notice: inquisitor_child: > Servant pcmk is healthy (age: 0) > Aug 14 08:57:26 cluster1 sbd[2199]: notice: inquisitor_child: Active > cluster detected > Aug 14 08:57:26 cluster1 crmd[2221]: notice: Initiating monitor > operation dlm:0_monitor_30000 locally on cluster1 > Aug 14 08:57:26 cluster1 crmd[2221]: notice: Transition 0 > (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-44.bz2): Complete > Aug 14 08:57:26 cluster1 crmd[2221]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE > Aug 14 08:57:27 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:27 cluster1 sbd[2202]: pcmk: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:28 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:28 cluster1 sbd[2202]: pcmk: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:28 cluster1 sbd[2202]: pcmk: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:29 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:29 cluster1 sbd[2202]: pcmk: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:30 cluster1 corosync[2208]: [CFG ] Config reload > requested by node 1 > Aug 14 08:57:30 cluster1 corosync[2208]: [TOTEM ] adding new UDPU > member {10.71.77.147} > Aug 14 08:57:30 cluster1 corosync[2208]: [QUORUM] This node is within > the non-primary component and will NOT provide any services. > Aug 14 08:57:30 cluster1 corosync[2208]: [QUORUM] Members[1]: 1 > Aug 14 08:57:30 cluster1 crmd[2221]: warning: Quorum lost > Aug 14 08:57:30 cluster1 pacemakerd[2215]: warning: Quorum lost > > ^^^^^^^^^ Looks unexpected Not so familiar with how corosync handles dynamic config-changes. Maybe you are on the loosing side of the tie-breaker or wait-for-all is kicking in if configured. Would be interesting how 2-node-setting would handle that. But 2-node-setting would of course break quorum-based-fencing. If you have a disk you could use as shared-disk for sbd you could achieve a quorum-disk-like-behavior. (your package-versions look as if you are using RHEL-7.4) > > > Aug 14 08:57:30 cluster1 sbd[2202]: pcmk: info: > set_servant_health: Quorum lost: Ignore > Aug 14 08:57:30 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:30 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:30 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:31 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:31 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:32 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:32 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:32 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:33 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:33 cluster1 sbd[2199]: warning: inquisitor_child: > Servant pcmk is outdated (age: 4) > Aug 14 08:57:33 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:34 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:34 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:35 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:35 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > Aug 14 08:57:36 cluster1 sbd[2203]: cluster: info: > notify_parent: Notifying parent: healthy > Aug 14 08:57:36 cluster1 sbd[2199]: warning: inquisitor_child: > Latency: No liveness for 4 s exceeds threshold of 3 s (healthy > servants: 0) > Aug 14 08:57:36 cluster1 sbd[2202]: pcmk: info: > notify_parent: Not notifying parent: state transient (2) > From sbd-pov this the expected behavior. sbd handles ignore, stop & freeze exactly the same by categorizing the problem as something transient that might be overcome within the watchdog-timeout. In case of suicide it would suicide immediately. Of course one might argue about if might make sense to not handle all 3 configurations the same in sbd - but that is how it is configured at the moment. Regards, Klaus > > Thanks, > --Edwin > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org