On 12/14/2016 01:26 PM, Jehan-Guillaume de Rorthais wrote: > On Thu, 8 Dec 2016 11:47:20 +0100 > Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote: > >> Hello, >> >> While setting this various parameters, I couldn't find documentation and >> details about them. Bellow some questions. >> >> Considering the watchdog module used on a server is set up with a 30s timer >> (lets call it the wdt, the "watchdog timer"), how should >> "SBD_WATCHDOG_TIMEOUT", "stonith-timeout" and "stonith-watchdog-timeout" be >> set? >> >> Here is my thinking so far: >> >> "SBD_WATCHDOG_TIMEOUT < wdt". The sbd daemon should reset the timer before >> the >> wdt expire so the server stay alive. Online resources and default values are >> usually "SBD_WATCHDOG_TIMEOUT=5s" and "wdt=30s". But what if sbd fails to >> reset the timer multiple times (eg. because of excessive load, swap storm >> etc)? The server will not reset before random*SBD_WATCHDOG_TIMEOUT or wdt, >> right?
SBD_WATCHDOG_TIMEOUT (e.g. in /etc/sysconfig/sbd) is already the timeout the hardware watchdog is configured to by sbd-daemon. sbd-daemon is triggering faster - timeout_loop defaults to 1s but is configurable. SBD_WATCHDOG_TIMEOUT (and maybe the loop timeout as well but significantly shorter should be sufficient) has to be configured so that failing to trigger within time means a failure with high enough certainty or the machine showing comparable response-times would anyway violate timing requirements of the services running on itself and in the cluster. Have in mind that sbd-daemon defaults to running realtime-scheduled and thus is gonna be more responsive than the usual services on the system. Although you of course have to consider that the watchers (child-processes of sbd that are observing e.g. the block-device(s), corosync, pacemaker_remoted or pacemaker node-health) might be significantly less responsive due to their communication partners. >> >> "stonith-watchdog-timeout > SBD_WATCHDOG_TIMEOUT". I'm not quite sure what is >> stonith-watchdog-timeout. Is it the maximum time to wait from stonithd after >> it asked for a node fencing before it considers the watchdog was actually >> triggered and the node reseted, even with no confirmation? I suppose >> "stonith-watchdog-timeout" is mostly useful to stonithd, right? Yes, the time we can assume a node to be killed by the hardware-watchdog... Double the hardware-watchdog-timeout is a good choice. >> >> "stonith-watchdog-timeout < stonith-timeout". I understand the stonith action >> timeout should be at least greater than the wdt so stonithd will not raise a >> timeout before the wdt had a chance to exprire and reset the node. Is it >> right? stonith-timeout is the cluster-wide-defaut to wait for stonith-devices to carry out their duty. In the sbd-case without a block-device (sbd used for pacemaker to be observed by a hardware-watchdog) it shouldn't play a role. When a block-device is being used it guards the communication with the fence-agent communicating with the block-device. > Anyone on these questions? I am currently writing some more doc/cookbook for > the PAF project[1], I would prefer being sure of what is written there :) > > [1] http://dalibo.github.io/PAF/documentation.html > > Regards, > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org