Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
05.12.2017 13:34, Gao,Yan пишет: > On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote: >> On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote: >>> 04.12.2017 14:48, Gao,Yan пишет: On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: > 30.11.2017 13:48, Gao,Yan пишет: >> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >>> VM on VSphere using shared VMDK as SBD. During basic tests by >>> killing >>> corosync and forcing STONITH pacemaker was not started after reboot. >>> In logs I see during boot >>> >>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >>> just fenced by sapprod01p for sapprod01p >>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >>> process (3151) can no longer be respawned, >>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >>> Pacemaker >>> >>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems >>> that >>> stonith with SBD always takes msgwait (at least, visually host is >>> not >>> declared as OFFLINE until 120s passed). But VM rebots lightning fast >>> and is up and running long before timeout expires. >>> >>> I think I have seen similar report already. Is it something that can >>> be fixed by SBD/pacemaker tuning? >> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. >> > > I tried it (on openSUSE Tumbleweed which is what I have at hand, it > has > SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch > disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. >>> >>> Unfortunately it waits too long. >>> >>> ha1:~ # systemctl status sbd.service >>> ● sbd.service - Shared-storage based fencing daemon >>> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor >>> preset: disabled) >>> Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; >>> 4min 16s ago >>> Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, >>> status=0/SUCCESS) >>> Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid >>> watch (code=killed, signa >>> Main PID: 1792 (code=exited, status=0/SUCCESS) >>> >>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing >>> daemon... >>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. >>> Terminating. >>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based >>> fencing daemon. >>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. >>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result >>> 'timeout'. >>> >>> But the real problem is - in spite of SBD failed to start, the whole >>> cluster stack continues to run; and because SBD blindly trusts in well >>> behaving nodes, fencing appears to succeed after timeout ... without >>> anyone taking any action on poison pill ... >> >> That's something I always wondered about: if a node is capable of >> reading a poison pill then it could before shutdown also write an >> "I'm leaving" message into its slot. Wouldn't that make sbd more >> reliable? Any reason not to implement that? > Probably it's not considered necessary :) SBD is a fencing mechanism > which only needs to ensure fencing works. I'm sorry, but SBD has zero chances to ensure fencing works. Recently I did storage vMotion of VM with shared VMDK for SBD - it silently created copy of VMDK which was indistinguishable from original one. As result both VMs run with own copy. Of course fencing did not work - but each VM *assumed* it worked because it posted message and waited for timeout ... I would expect "monitor" action of SBD fencing agent to actually test whether messages are seen by remote node(s) ... > SBD on the fencing target is > either there eating the pill or getting reset by watchdog, otherwise > it's not there which is supposed to imply the whole cluster stack is not > running so that it doesn't need to actually eat the pill. > > How systemd should handle the service dependencies is another topic... > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
05.12.2017 12:59, Gao,Yan пишет: > On 12/04/2017 07:55 PM, Andrei Borzenkov wrote: >> 04.12.2017 14:48, Gao,Yan пишет: >>> On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: > On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >> VM on VSphere using shared VMDK as SBD. During basic tests by killing >> corosync and forcing STONITH pacemaker was not started after reboot. >> In logs I see during boot >> >> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >> just fenced by sapprod01p for sapprod01p >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >> process (3151) can no longer be respawned, >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >> Pacemaker >> >> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >> stonith with SBD always takes msgwait (at least, visually host is not >> declared as OFFLINE until 120s passed). But VM rebots lightning fast >> and is up and running long before timeout expires. >> >> I think I have seen similar report already. Is it something that can >> be fixed by SBD/pacemaker tuning? > SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. > I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. >>> It simply waits that long on startup before starting the rest of the >>> cluster stack to make sure the fencing that targeted it has returned. It >>> intentionally doesn't watch anything during this period of time. >>> >> >> Unfortunately it waits too long. >> >> ha1:~ # systemctl status sbd.service >> ● sbd.service - Shared-storage based fencing daemon >> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor >> preset: disabled) >> Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; >> 4min 16s ago >> Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, >> status=0/SUCCESS) >> Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid >> watch (code=killed, signa >> Main PID: 1792 (code=exited, status=0/SUCCESS) >> >> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing >> daemon... >> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. >> Terminating. >> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based >> fencing daemon. >> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. >> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result >> 'timeout'. >> >> But the real problem is - in spite of SBD failed to start, the whole >> cluster stack continues to run; and because SBD blindly trusts in well >> behaving nodes, fencing appears to succeed after timeout ... without >> anyone taking any action on poison pill ... > Start of sbd reaches systemd's timeout for starting units and systemd > proceeds... > You consider it normal and intended behavior? Again - currently it is possible that cluster stack starts without having working STONITH and because there is no confirmation whether stonith via SBD worked at all, we get into split brain. > TimeoutStartSec should be configured in sbd.service accordingly to be > longer than msgwait. > And where is it documented? You did not say it earlier, /etc/sysconfig/sbd does not say it, "man sbd" does not say it. How should users be aware about this? ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
On Tue, 2017-12-05 at 17:43 +0100, Jehan-Guillaume de Rorthais wrote: > On Tue, 05 Dec 2017 08:59:55 -0600 > Ken Gaillotwrote: > > > On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote: > > > > > > Tomas Jelinek schrieb am 04.12.2017 > > > > > > um > > > > > > 16:50 in Nachricht > > > > > > <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>: > > > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): > > > > > On Mon, 4 Dec 2017 12:31:06 +0100 > > > > > Tomas Jelinek wrote: > > > > > > > > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais > > > > > > napsal(a): > > > > > > > On Fri, 01 Dec 2017 16:34:08 -0600 > > > > > > > Ken Gaillot wrote: > > > > > > > > > > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Kristoffer Gronlund wrote: > > > > > > > > > > > Adam Spiers writes: > > > > > > > > > > > > > > > > > > > > > > > - The whole cluster is shut down cleanly. > > > > > > > > > > > > > > > > > > > > > > > > - The whole cluster is then started up > > > > > > > > > > > > again. (Side question: > > > > > > > > > > > > what > > > > > > > > > > > > happens if the last node to shut down is > > > > > > > > > > > > not > > > > > > > > > > > > the first to > > > > > > > > > > > > start up? > > > > > > > > > > > > How will the cluster ensure it has the most > > > > > > > > > > > > recent version of > > > > > > > > > > > > the > > > > > > > > > > > > CIB? Without that, how would it know > > > > > > > > > > > > whether > > > > > > > > > > > > the last man > > > > > > > > > > > > standing > > > > > > > > > > > > was shut down cleanly or not?) > > > > > > > > > > > > > > > > > > > > > > This is my opinion, I don't really know what the > > > > > > > > > > > "official" > > > > > > > > > > > pacemaker > > > > > > > > > > > stance is: There is no such thing as shutting > > > > > > > > > > > down a > > > > > > > > > > > cluster > > > > > > > > > > > cleanly. A > > > > > > > > > > > cluster is a process stretching over multiple > > > > > > > > > > > nodes - > > > > > > > > > > > if they all > > > > > > > > > > > shut > > > > > > > > > > > down, the process is gone. When you start up > > > > > > > > > > > again, > > > > > > > > > > > you > > > > > > > > > > > effectively have > > > > > > > > > > > a completely new cluster. > > > > > > > > > > > > > > > > > > > > Sorry, I don't follow you at all here. When you > > > > > > > > > > start > > > > > > > > > > the cluster > > > > > > > > > > up > > > > > > > > > > again, the cluster config from before the shutdown > > > > > > > > > > is > > > > > > > > > > still there. > > > > > > > > > > That's very far from being a completely new cluster > > > > > > > > > > :-) > > > > > > > > > > > > > > > > > > The problem is you cannot "start the cluster" in > > > > > > > > > pacemaker; you can > > > > > > > > > only "start nodes". The nodes will come up one by > > > > > > > > > one. As > > > > > > > > > opposed (as > > > > > > > > > I had said) to HP Sertvice Guard, where there is a > > > > > > > > > "cluster formation > > > > > > > > > timeout". That is, the nodes wait for the specified > > > > > > > > > time > > > > > > > > > for the > > > > > > > > > cluster to "form". Then the cluster starts as a > > > > > > > > > whole. Of > > > > > > > > > course that > > > > > > > > > only applies if the whole cluster was down, not if a > > > > > > > > > single node was > > > > > > > > > down. > > > > > > > > > > > > > > > > I'm not sure what that would specifically entail, but > > > > > > > > I'm > > > > > > > > guessing we > > > > > > > > have some of the pieces already: > > > > > > > > > > > > > > > > - Corosync has a wait_for_all option if you want the > > > > > > > > cluster to be > > > > > > > > unable to have quorum at start-up until every node has > > > > > > > > joined. I don't > > > > > > > > think you can set a timeout that cancels it, though. > > > > > > > > > > > > > > > > - Pacemaker will wait dc-deadtime for the first DC > > > > > > > > election > > > > > > > > to > > > > > > > > complete. (if I understand it correctly ...) > > > > > > > > > > > > > > > > - Higher-level tools can start or stop all nodes > > > > > > > > together > > > > > > > > (e.g. pcs has > > > > > > > > pcs cluster start/stop --all). > > > > > > > > > > > > > > Based on this discussion, I have some questions about > > > > > > > pcs: > > > > > > > > > > > > > > * how is it shutting down the cluster when issuing "pcs > > > > > > > cluster stop > > > > > > > --all"? > > > > > > > > > > > > First, it sends a request to each node to stop pacemaker. > > > > > > The > > > > > > requests > > > > > > are sent in parallel which prevents resources from being > > > > > > moved > > > > > > from node > > > > > > to node. Once pacemaker stops on all nodes, corosync
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 03:11 PM, Ulrich Windl wrote: "Gao,Yan"schrieb am 05.12.2017 um 15:04 in Nachricht : On 12/05/2017 12:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 01.12.2017 um 20:36 in Nachricht : [...] I meant: There are three delays: 1) The delay until data is on the disk It takes several IOs for the sender to do this -- read the device header, lookup the slot, write the message and verify the message is written (-- A timeout_io defaults to 3s). As mentioned, msgwait timer of the sender starts only after message has been verified to be written. We just need to make sure stonith-timeout is configured longer enough than the sum. 2) Delay until date is read from the disk It's already taken into account with msgwait. Considering the recipient keeps reading in a loop, we don't know when exactly it starts to read for this specific message. But once it starts a reading, it has to be done within timeout_watchdog, otherwise watchdog triggers. So even for a bad case, the message should be read within 2* timemout_watchdog. That's the reason why the sender has to wait msgwait, which is 2 * timeout_watchdog. 3) Delay until Host was killed Kill is basically immediately triggered once poison pill is read. Considering that the response time of a SAN disk system with cache is typically a very few microseconds, writing to disk may be even "more immediate" than killing the node via watchdog reset ;-) Well, it's possible :) Timeout matters for "bad cases" though. Compared with a disk io facing difficulties like path failure and so on, triggering watchdog is trivial. So you can't easily say one is immediate, while the other has to be waited for IMHO. Of course a even longer msgwait with all the factors that you can think of taken into account will be even safer. Regards, Yan Regards, Ulrich A confirmation before 3) could shorten the total wait that includes 2) and 3), right? As mentioned in another email, an alive node, even indeed coming back from death, cannot actually confirm itself or even give a confirmation about if it was ever dead. And a successful fencing means the node being dead. Regards, Yan Regards, Ulrich Regards, Yan [...] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
On Tue, 05 Dec 2017 08:59:55 -0600 Ken Gaillotwrote: > On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote: > > > > > Tomas Jelinek schrieb am 04.12.2017 um > > > > > 16:50 in Nachricht > > > > <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>: > > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): > > > > On Mon, 4 Dec 2017 12:31:06 +0100 > > > > Tomas Jelinek wrote: > > > > > > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): > > > > > > On Fri, 01 Dec 2017 16:34:08 -0600 > > > > > > Ken Gaillot wrote: > > > > > > > > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Kristoffer Gronlund wrote: > > > > > > > > > > Adam Spiers writes: > > > > > > > > > > > > > > > > > > > > > - The whole cluster is shut down cleanly. > > > > > > > > > > > > > > > > > > > > > > - The whole cluster is then started up > > > > > > > > > > > again. (Side question: > > > > > > > > > > > what > > > > > > > > > > > happens if the last node to shut down is not > > > > > > > > > > > the first to > > > > > > > > > > > start up? > > > > > > > > > > > How will the cluster ensure it has the most > > > > > > > > > > > recent version of > > > > > > > > > > > the > > > > > > > > > > > CIB? Without that, how would it know whether > > > > > > > > > > > the last man > > > > > > > > > > > standing > > > > > > > > > > > was shut down cleanly or not?) > > > > > > > > > > > > > > > > > > > > This is my opinion, I don't really know what the > > > > > > > > > > "official" > > > > > > > > > > pacemaker > > > > > > > > > > stance is: There is no such thing as shutting down a > > > > > > > > > > cluster > > > > > > > > > > cleanly. A > > > > > > > > > > cluster is a process stretching over multiple nodes - > > > > > > > > > > if they all > > > > > > > > > > shut > > > > > > > > > > down, the process is gone. When you start up again, > > > > > > > > > > you > > > > > > > > > > effectively have > > > > > > > > > > a completely new cluster. > > > > > > > > > > > > > > > > > > Sorry, I don't follow you at all here. When you start > > > > > > > > > the cluster > > > > > > > > > up > > > > > > > > > again, the cluster config from before the shutdown is > > > > > > > > > still there. > > > > > > > > > That's very far from being a completely new cluster :-) > > > > > > > > > > > > > > > > The problem is you cannot "start the cluster" in > > > > > > > > pacemaker; you can > > > > > > > > only "start nodes". The nodes will come up one by one. As > > > > > > > > opposed (as > > > > > > > > I had said) to HP Sertvice Guard, where there is a > > > > > > > > "cluster formation > > > > > > > > timeout". That is, the nodes wait for the specified time > > > > > > > > for the > > > > > > > > cluster to "form". Then the cluster starts as a whole. Of > > > > > > > > course that > > > > > > > > only applies if the whole cluster was down, not if a > > > > > > > > single node was > > > > > > > > down. > > > > > > > > > > > > > > I'm not sure what that would specifically entail, but I'm > > > > > > > guessing we > > > > > > > have some of the pieces already: > > > > > > > > > > > > > > - Corosync has a wait_for_all option if you want the > > > > > > > cluster to be > > > > > > > unable to have quorum at start-up until every node has > > > > > > > joined. I don't > > > > > > > think you can set a timeout that cancels it, though. > > > > > > > > > > > > > > - Pacemaker will wait dc-deadtime for the first DC election > > > > > > > to > > > > > > > complete. (if I understand it correctly ...) > > > > > > > > > > > > > > - Higher-level tools can start or stop all nodes together > > > > > > > (e.g. pcs has > > > > > > > pcs cluster start/stop --all). > > > > > > > > > > > > Based on this discussion, I have some questions about pcs: > > > > > > > > > > > > * how is it shutting down the cluster when issuing "pcs > > > > > > cluster stop > > > > > > --all"? > > > > > > > > > > First, it sends a request to each node to stop pacemaker. The > > > > > requests > > > > > are sent in parallel which prevents resources from being moved > > > > > from node > > > > > to node. Once pacemaker stops on all nodes, corosync is stopped > > > > > on all > > > > > nodes in the same manner. > > > > > > > > What if for some external reasons one node is slower (load, > > > > network, > > > > > > whatever) > > > > than the others and start reacting ? Sending queries in parallel > > > > doesn't > > > > feels safe enough in regard with all the race conditions that can > > > > occurs in > > > > > > the > > > > same time. > > > > > > > > Am I missing something ? > > > > > > > > > > If a node gets the request later than others, some resources may > > > be > > > moved to it before it starts shutting down pacemaker as well. Pcs >
Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote: > > > > Tomas Jelinekschrieb am 04.12.2017 um > > > > 16:50 in Nachricht > > <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>: > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): > > > On Mon, 4 Dec 2017 12:31:06 +0100 > > > Tomas Jelinek wrote: > > > > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): > > > > > On Fri, 01 Dec 2017 16:34:08 -0600 > > > > > Ken Gaillot wrote: > > > > > > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: > > > > > > > > > > > > > > > > > > > > > > Kristoffer Gronlund wrote: > > > > > > > > > Adam Spiers writes: > > > > > > > > > > > > > > > > > > > - The whole cluster is shut down cleanly. > > > > > > > > > > > > > > > > > > > > - The whole cluster is then started up > > > > > > > > > > again. (Side question: > > > > > > > > > > what > > > > > > > > > > happens if the last node to shut down is not > > > > > > > > > > the first to > > > > > > > > > > start up? > > > > > > > > > > How will the cluster ensure it has the most > > > > > > > > > > recent version of > > > > > > > > > > the > > > > > > > > > > CIB? Without that, how would it know whether > > > > > > > > > > the last man > > > > > > > > > > standing > > > > > > > > > > was shut down cleanly or not?) > > > > > > > > > > > > > > > > > > This is my opinion, I don't really know what the > > > > > > > > > "official" > > > > > > > > > pacemaker > > > > > > > > > stance is: There is no such thing as shutting down a > > > > > > > > > cluster > > > > > > > > > cleanly. A > > > > > > > > > cluster is a process stretching over multiple nodes - > > > > > > > > > if they all > > > > > > > > > shut > > > > > > > > > down, the process is gone. When you start up again, > > > > > > > > > you > > > > > > > > > effectively have > > > > > > > > > a completely new cluster. > > > > > > > > > > > > > > > > Sorry, I don't follow you at all here. When you start > > > > > > > > the cluster > > > > > > > > up > > > > > > > > again, the cluster config from before the shutdown is > > > > > > > > still there. > > > > > > > > That's very far from being a completely new cluster :-) > > > > > > > > > > > > > > The problem is you cannot "start the cluster" in > > > > > > > pacemaker; you can > > > > > > > only "start nodes". The nodes will come up one by one. As > > > > > > > opposed (as > > > > > > > I had said) to HP Sertvice Guard, where there is a > > > > > > > "cluster formation > > > > > > > timeout". That is, the nodes wait for the specified time > > > > > > > for the > > > > > > > cluster to "form". Then the cluster starts as a whole. Of > > > > > > > course that > > > > > > > only applies if the whole cluster was down, not if a > > > > > > > single node was > > > > > > > down. > > > > > > > > > > > > I'm not sure what that would specifically entail, but I'm > > > > > > guessing we > > > > > > have some of the pieces already: > > > > > > > > > > > > - Corosync has a wait_for_all option if you want the > > > > > > cluster to be > > > > > > unable to have quorum at start-up until every node has > > > > > > joined. I don't > > > > > > think you can set a timeout that cancels it, though. > > > > > > > > > > > > - Pacemaker will wait dc-deadtime for the first DC election > > > > > > to > > > > > > complete. (if I understand it correctly ...) > > > > > > > > > > > > - Higher-level tools can start or stop all nodes together > > > > > > (e.g. pcs has > > > > > > pcs cluster start/stop --all). > > > > > > > > > > Based on this discussion, I have some questions about pcs: > > > > > > > > > > * how is it shutting down the cluster when issuing "pcs > > > > > cluster stop > > > > > --all"? > > > > > > > > First, it sends a request to each node to stop pacemaker. The > > > > requests > > > > are sent in parallel which prevents resources from being moved > > > > from node > > > > to node. Once pacemaker stops on all nodes, corosync is stopped > > > > on all > > > > nodes in the same manner. > > > > > > What if for some external reasons one node is slower (load, > > > network, > > > > whatever) > > > than the others and start reacting ? Sending queries in parallel > > > doesn't > > > feels safe enough in regard with all the race conditions that can > > > occurs in > > > > the > > > same time. > > > > > > Am I missing something ? > > > > > > > If a node gets the request later than others, some resources may > > be > > moved to it before it starts shutting down pacemaker as well. Pcs > > waits > > I think that's impossible due to the ordering of corosync: If a > standby is issued, and a resource migration is the consequence, every > node will see the standby before it sees any other config change. > Right? pcs doesn't issue a standby, just a shutdown. When a node needs to shut down, it sends
[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
>>> "Gao,Yan"schrieb am 05.12.2017 um 15:04 in Nachricht : > On 12/05/2017 12:41 PM, Ulrich Windl wrote: >> >> > "Gao,Yan" schrieb am 01.12.2017 um 20:36 in Nachricht >> : [...] >> >> I meant: There are three delays: >> 1) The delay until data is on the disk > It takes several IOs for the sender to do this -- read the device > header, lookup the slot, write the message and verify the message is > written (-- A timeout_io defaults to 3s). > > As mentioned, msgwait timer of the sender starts only after message has > been verified to be written. We just need to make sure stonith-timeout > is configured longer enough than the sum. > >> 2) Delay until date is read from the disk > It's already taken into account with msgwait. Considering the recipient > keeps reading in a loop, we don't know when exactly it starts to read > for this specific message. But once it starts a reading, it has to be > done within timeout_watchdog, otherwise watchdog triggers. So even for a > bad case, the message should be read within 2* timemout_watchdog. That's > the reason why the sender has to wait msgwait, which is 2 * > timeout_watchdog. > >> 3) Delay until Host was killed > Kill is basically immediately triggered once poison pill is read. Considering that the response time of a SAN disk system with cache is typically a very few microseconds, writing to disk may be even "more immediate" than killing the node via watchdog reset ;-) So you can't easily say one is immediate, while the other has to be waited for IMHO. Regards, Ulrich > >> A confirmation before 3) could shorten the total wait that includes 2) and > 3), >> right? > As mentioned in another email, an alive node, even indeed coming back > from death, cannot actually confirm itself or even give a confirmation > about if it was ever dead. And a successful fencing means the node being > dead. > > Regards, >Yan > > >> >> Regards, >> Ulrich >> >> >>> >>> Regards, >>> Yan >>> [...] ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 12:41 PM, Ulrich Windl wrote: "Gao,Yan"schrieb am 01.12.2017 um 20:36 in Nachricht : On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: 30.11.2017 16:11, Klaus Wenninger пишет: On 11/30/2017 01:41 PM, Ulrich Windl wrote: "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot time (I guess), this just shows a fundamental problem in SBD design: Receipt of the fencing command is not confirmed (other than by seeing the consequences of ist execution). The 2 x msgwait is not for confirmations but for writing the poison-pill and for having it read by the target-side. Yes, of course, but that's not what Urlich likely intended to say. msgwait must account for worst case storage path latency, while in normal cases it happens much faster. If fenced node could acknowledge having been killed after reboot, stonith agent could return success much earlier. How could an alive man be sure he died before? ;) I meant: There are three delays: 1) The delay until data is on the disk It takes several IOs for the sender to do this -- read the device header, lookup the slot, write the message and verify the message is written (-- A timeout_io defaults to 3s). As mentioned, msgwait timer of the sender starts only after message has been verified to be written. We just need to make sure stonith-timeout is configured longer enough than the sum. 2) Delay until date is read from the disk It's already taken into account with msgwait. Considering the recipient keeps reading in a loop, we don't know when exactly it starts to read for this specific message. But once it starts a reading, it has to be done within timeout_watchdog, otherwise watchdog triggers. So even for a bad case, the message should be read within 2* timemout_watchdog. That's the reason why the sender has to wait msgwait, which is 2 * timeout_watchdog. 3) Delay until Host was killed Kill is basically immediately triggered once poison pill is read. A confirmation before 3) could shorten the total wait that includes 2) and 3), right? As mentioned in another email, an alive node, even indeed coming back from death, cannot actually confirm itself or even give a confirmation about if it was ever dead. And a successful fencing means the node being dead. Regards, Yan Regards, Ulrich Regards, Yan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
>>> Tomas Jelinekschrieb am 04.12.2017 um 16:50 in >>> Nachricht <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>: > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): >> On Mon, 4 Dec 2017 12:31:06 +0100 >> Tomas Jelinek wrote: >> >>> Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): On Fri, 01 Dec 2017 16:34:08 -0600 Ken Gaillot wrote: > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: >> >> >>> Kristoffer Gronlund wrote: Adam Spiers writes: > - The whole cluster is shut down cleanly. > > - The whole cluster is then started up again. (Side question: > what > happens if the last node to shut down is not the first to > start up? > How will the cluster ensure it has the most recent version of > the > CIB? Without that, how would it know whether the last man > standing > was shut down cleanly or not?) This is my opinion, I don't really know what the "official" pacemaker stance is: There is no such thing as shutting down a cluster cleanly. A cluster is a process stretching over multiple nodes - if they all shut down, the process is gone. When you start up again, you effectively have a completely new cluster. >>> >>> Sorry, I don't follow you at all here. When you start the cluster >>> up >>> again, the cluster config from before the shutdown is still there. >>> That's very far from being a completely new cluster :-) >> >> The problem is you cannot "start the cluster" in pacemaker; you can >> only "start nodes". The nodes will come up one by one. As opposed (as >> I had said) to HP Sertvice Guard, where there is a "cluster formation >> timeout". That is, the nodes wait for the specified time for the >> cluster to "form". Then the cluster starts as a whole. Of course that >> only applies if the whole cluster was down, not if a single node was >> down. > > I'm not sure what that would specifically entail, but I'm guessing we > have some of the pieces already: > > - Corosync has a wait_for_all option if you want the cluster to be > unable to have quorum at start-up until every node has joined. I don't > think you can set a timeout that cancels it, though. > > - Pacemaker will wait dc-deadtime for the first DC election to > complete. (if I understand it correctly ...) > > - Higher-level tools can start or stop all nodes together (e.g. pcs has > pcs cluster start/stop --all). Based on this discussion, I have some questions about pcs: * how is it shutting down the cluster when issuing "pcs cluster stop --all"? >>> >>> First, it sends a request to each node to stop pacemaker. The requests >>> are sent in parallel which prevents resources from being moved from node >>> to node. Once pacemaker stops on all nodes, corosync is stopped on all >>> nodes in the same manner. >> >> What if for some external reasons one node is slower (load, network, > whatever) >> than the others and start reacting ? Sending queries in parallel doesn't >> feels safe enough in regard with all the race conditions that can occurs in > the >> same time. >> >> Am I missing something ? >> > > If a node gets the request later than others, some resources may be > moved to it before it starts shutting down pacemaker as well. Pcs waits I think that's impossible due to the ordering of corosync: If a standby is issued, and a resource migration is the consequence, every node will see the standby before it sees any other config change. Right? > for all nodes to shutdown pacemaker before it moves to shutting down > corosync. This way, quorum is maintained the whole time pacemaker is > shutting down and therefore no services are blocked from stopping due to > lack of quorum. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
>>> Klaus Wenningerschrieb am 04.12.2017 um 16:20 in Nachricht <24ed4710-322d-0560-b0ff-792cbf53d...@redhat.com>: > On 12/04/2017 04:02 PM, Kristoffer Grönlund wrote: >> Tomas Jelinek writes: >> * how is it shutting down the cluster when issuing "pcs cluster stop --all"? >>> First, it sends a request to each node to stop pacemaker. The requests >>> are sent in parallel which prevents resources from being moved from node >>> to node. Once pacemaker stops on all nodes, corosync is stopped on all >>> nodes in the same manner. >>> * any race condition possible where the cib will record only one node up > before the last one shut down? * will the cluster start safely? >> That definitely sounds racy to me. The best idea I can think of would be >> to set all nodes except one in standby, and then shutdown pacemaker >> everywhere... > > Really mean standby or rather maintenance to keep resources > from switching to the still alive nodes during shutdown? If all nodes are set to standby in one transaction, there is no switching, just a stop of all resources. > > Regards, > Klaus > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
>>> Kristoffer Grönlundschrieb am 04.12.2017 um 16:02 in Nachricht <87fu8qedu0@suse.com>: > Tomas Jelinek writes: > >>> >>> * how is it shutting down the cluster when issuing "pcs cluster stop --all"? >> >> First, it sends a request to each node to stop pacemaker. The requests >> are sent in parallel which prevents resources from being moved from node >> to node. Once pacemaker stops on all nodes, corosync is stopped on all >> nodes in the same manner. >> >>> * any race condition possible where the cib will record only one node up > before >>>the last one shut down? >>> * will the cluster start safely? > > That definitely sounds racy to me. The best idea I can think of would be > to set all nodes except one in standby, and then shutdown pacemaker > everywhere... Why not all nodes? > > -- > // Kristoffer Grönlund > // kgronl...@suse.com > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing
>>> Jehan-Guillaume de Rorthaisschrieb am 04.12.2017 um >>> 14:21 in Nachricht <20171204142148.446ec356@firost>: > On Mon, 4 Dec 2017 12:31:06 +0100 > Tomas Jelinek wrote: > >> Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): >> > On Fri, 01 Dec 2017 16:34:08 -0600 >> > Ken Gaillot wrote: >> > >> >> On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: >> >>> >> >>> >> Kristoffer Gronlund wrote: >> > Adam Spiers writes: >> > >> >> - The whole cluster is shut down cleanly. >> >> >> >> - The whole cluster is then started up again. (Side question: >> >> what >> >>happens if the last node to shut down is not the first to >> >> start up? >> >>How will the cluster ensure it has the most recent version of >> >> the >> >>CIB? Without that, how would it know whether the last man >> >> standing >> >>was shut down cleanly or not?) >> > >> > This is my opinion, I don't really know what the "official" >> > pacemaker >> > stance is: There is no such thing as shutting down a cluster >> > cleanly. A >> > cluster is a process stretching over multiple nodes - if they all >> > shut >> > down, the process is gone. When you start up again, you >> > effectively have >> > a completely new cluster. >> >> Sorry, I don't follow you at all here. When you start the cluster >> up >> again, the cluster config from before the shutdown is still there. >> That's very far from being a completely new cluster :-) >> >>> >> >>> The problem is you cannot "start the cluster" in pacemaker; you can >> >>> only "start nodes". The nodes will come up one by one. As opposed (as >> >>> I had said) to HP Sertvice Guard, where there is a "cluster formation >> >>> timeout". That is, the nodes wait for the specified time for the >> >>> cluster to "form". Then the cluster starts as a whole. Of course that >> >>> only applies if the whole cluster was down, not if a single node was >> >>> down. >> >> >> >> I'm not sure what that would specifically entail, but I'm guessing we >> >> have some of the pieces already: >> >> >> >> - Corosync has a wait_for_all option if you want the cluster to be >> >> unable to have quorum at start-up until every node has joined. I don't >> >> think you can set a timeout that cancels it, though. >> >> >> >> - Pacemaker will wait dc-deadtime for the first DC election to >> >> complete. (if I understand it correctly ...) >> >> >> >> - Higher-level tools can start or stop all nodes together (e.g. pcs has >> >> pcs cluster start/stop --all). >> > >> > Based on this discussion, I have some questions about pcs: >> > >> > * how is it shutting down the cluster when issuing "pcs cluster stop >> > --all"? >> >> First, it sends a request to each node to stop pacemaker. The requests >> are sent in parallel which prevents resources from being moved from node >> to node. Once pacemaker stops on all nodes, corosync is stopped on all >> nodes in the same manner. > > What if for some external reasons one node is slower (load, network, > whatever) > than the others and start reacting ? Sending queries in parallel doesn't > feels safe enough in regard with all the race conditions that can occurs in > the > same time. > > Am I missing something ? I can only agree that this type of "cluster ghutdown" is unclean, leaving each node with a different CIB most likely (and many aborted transitions). > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.
>>> "Gao,Yan"schrieb am 01.12.2017 um 20:36 in Nachricht : > On 11/30/2017 06:48 PM, Andrei Borzenkov wrote: >> 30.11.2017 16:11, Klaus Wenninger пишет: >>> On 11/30/2017 01:41 PM, Ulrich Windl wrote: >>> "Gao,Yan" schrieb am 30.11.2017 um 11:48 in Nachricht : > On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >> VM on VSphere using shared VMDK as SBD. During basic tests by killing >> corosync and forcing STONITH pacemaker was not started after reboot. >> In logs I see during boot >> >> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >> just fenced by sapprod01p for sapprod01p >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >> process (3151) can no longer be respawned, >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down > Pacemaker >> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >> stonith with SBD always takes msgwait (at least, visually host is not >> declared as OFFLINE until 120s passed). But VM rebots lightning fast >> and is up and running long before timeout expires. As msgwait was intended for the message to arrive, and not for the reboot > time (I guess), this just shows a fundamental problem in SBD design: Receipt > of the fencing command is not confirmed (other than by seeing the > consequences of ist execution). >>> >>> The 2 x msgwait is not for confirmations but for writing the poison-pill >>> and for >>> having it read by the target-side. >> >> Yes, of course, but that's not what Urlich likely intended to say. >> msgwait must account for worst case storage path latency, while in >> normal cases it happens much faster. If fenced node could acknowledge >> having been killed after reboot, stonith agent could return success much >> earlier. > How could an alive man be sure he died before? ;) I meant: There are three delays: 1) The delay until data is on the disk 2) Delay until date is read from the disk 3) Delay until Host was killed A confirmation before 3) could shorten the total wait that includes 2) and 3), right? Regards, Ulrich > > Regards, >Yan > >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote: On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote: 04.12.2017 14:48, Gao,Yan пишет: On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... That's something I always wondered about: if a node is capable of reading a poison pill then it could before shutdown also write an "I'm leaving" message into its slot. Wouldn't that make sbd more reliable? Any reason not to implement that? Probably it's not considered necessary :) SBD is a fencing mechanism which only needs to ensure fencing works. SBD on the fencing target is either there eating the pill or getting reset by watchdog, otherwise it's not there which is supposed to imply the whole cluster stack is not running so that it doesn't need to actually eat the pill. How systemd should handle the service dependencies is another topic... Regards, Yan Thanks, Dejan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: questions about startup fencing
On 05/12/17 10:01 +0100, Tomas Jelinek wrote: > The first attempt to fix the issue was to put nodes into standby mode with > --lifetime=reboot: > https://github.com/ClusterLabs/pcs/commit/ea6f37983191776fd46d90f22dc1432e0bfc0b91 > > This didn't work for several reasons. One of them was back then there was no > reliable way to set standby mode with --lifetime=reboot for more than one > node in a single step. (This may have been fixed in the meantime.) There > were however other serious reasons for not putting the nodes into standby as > was explained by Andrew: > - it [putting the nodes into standby first] means shutdown takes longer (no > node stops until all the resources stop) > - it makes shutdown more complex (== more fragile), eg... > - it result in pcs waiting forever for resources to stop > - if a stop fails and the cluster is configured to start at boot, then the > node will get fenced and happily run resources when it returns > (because all the nodes are up so we still have quorum) Isn't one-off stopping of a cluster without actually disabling cluster software to run on boot rather antithetical? And beside, isn't this ressurection scenario possible also with the current parallel (hence subject to race condition) stop in such case, anyway? > - only potentially benefits resources that have no (or very few) dependants > and can stop quicker than it takes pcs to get through its "initiate parallel > shutdown" loop (which should be rather fast since there is no ssh connection > setup overheads) > > So we ended up with just stopping pacemaker in parallel: > https://github.com/ClusterLabs/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d -- Jan (Poki) pgpfTSpu5jsUE.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On 12/04/2017 07:55 PM, Andrei Borzenkov wrote: 04.12.2017 14:48, Gao,Yan пишет: On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: 30.11.2017 13:48, Gao,Yan пишет: On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with VM on VSphere using shared VMDK as SBD. During basic tests by killing corosync and forcing STONITH pacemaker was not started after reboot. In logs I see during boot Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly just fenced by sapprod01p for sapprod01p Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd process (3151) can no longer be respawned, Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down Pacemaker SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that stonith with SBD always takes msgwait (at least, visually host is not declared as OFFLINE until 120s passed). But VM rebots lightning fast and is up and running long before timeout expires. I think I have seen similar report already. Is it something that can be fixed by SBD/pacemaker tuning? SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. I tried it (on openSUSE Tumbleweed which is what I have at hand, it has SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch disk at all. It simply waits that long on startup before starting the rest of the cluster stack to make sure the fencing that targeted it has returned. It intentionally doesn't watch anything during this period of time. Unfortunately it waits too long. ha1:~ # systemctl status sbd.service ● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; 4min 16s ago Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, status=0/SUCCESS) Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=killed, signa Main PID: 1792 (code=exited, status=0/SUCCESS) дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing daemon... дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. Terminating. дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based fencing daemon. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. But the real problem is - in spite of SBD failed to start, the whole cluster stack continues to run; and because SBD blindly trusts in well behaving nodes, fencing appears to succeed after timeout ... without anyone taking any action on poison pill ... Start of sbd reaches systemd's timeout for starting units and systemd proceeds... TimeoutStartSec should be configured in sbd.service accordingly to be longer than msgwait. Regards, Yan ha1:~ # systemctl show sbd.service -p RequiredBy RequiredBy=corosync.service but ha1:~ # systemctl status corosync.service ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS) Process: 2059 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS) Main PID: 2073 (corosync) Tasks: 2 (limit: 4915) CGroup: /system.slice/corosync.service └─2073 corosync and ha1:~ # crm_mon -1r Stack: corosync Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum Last updated: Mon Dec 4 21:53:24 2017 Last change: Mon Dec 4 21:47:25 2017 by hacluster via crmd on ha1 2 nodes configured 1 resource configured Online: [ ha1 ha2 ] Full list of resources: stonith-sbd (stonith:external/sbd): Started ha1 and if I now sever connection between two nodes I will get two single node clusters each believing it won ... ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pcs create master/slave resource doesn't work (Ken Gaillot)
Thank you very much Ken!! You nailed it, now it's working :-) On Tue, Dec 5, 2017 at 5:29 AM, Ken Gaillotwrote: > On Mon, 2017-12-04 at 23:15 +0800, Hui Xiang wrote: > > Thanks Ken very much for the helpful information. It indeed help a > > lot for debbuging. > > > > " Each time the DC decides what to do, there will be a line like > > "... > > saving inputs in ..." with a file name. The log messages just before > > that may give some useful information." > > - I am unable to find such information in the logs, it only prints > > some like /var/lib/pacemaker/pengine/pe-input-xx > > If the cluster had nothing to do, it won't show anything, but if > actions were needed, it should show them, like > "Start myrsc ( node1 )". > > Are there any messages with "error" or "warning" in the log? > > > When I am comparing the cib.xml file of good with bad one, it > > diffetiates from the order of "name" and "id" as below shown, does it > > matter for cib to function normally? > > No, the XML attributes can be any order. > > I just noticed that your cluster has symmetric-cluster=false. That > means that resources can't run anywhere by default; in order for a > resource to run, there must be a location constraint allowing it to run > on a node. Have you added such constraints? > > > > > > > > name="monitor" timeout="30"/> > > > timeout="60"/> > > > timeout="60"/> > > > name="promote" timeout="60"/> > > > name="demote" timeout="60"/> > > > > > > > > > timeout="30" id="ovndb-servers-monitor-20"/> > > > > > > > > > > > > Thanks. > > Hui. > > > > > > On Sat, Dec 2, 2017 at 5:07 AM, Ken Gaillot > > wrote: > > > On Fri, 2017-12-01 at 09:36 +0800, Hui Xiang wrote: > > > > Hi all, > > > > > > > > I am using the ovndb-servers ocf agent[1] which is a kind of > > > multi- > > > > state resource,when I am creating it(please see my previous > > > email), > > > > the monitor is called only once, and the start operation is never > > > > called, according to below description, the once called monitor > > > > operation returned OCF_NOT_RUNNING, should the pacemaker will > > > decide > > > > to execute start action based this return code? is there any way > > > to > > > > > > Before Pacemaker does anything with a resource, it first calls a > > > one- > > > time monitor (called a "probe") to find out the current status of > > > the > > > resource across the cluster. This allows it to discover if the > > > service > > > is already running somewhere. > > > > > > So, you will see those probes for every resource when the cluster > > > starts, or when the resource is added to the configuration, or when > > > the > > > resource is cleaned up. > > > > > > > check out what is the next action? Currently in my environment > > > > nothing happened and I am almost tried all I known ways to debug, > > > > however, no lucky, could anyone help it out? thank you very much. > > > > > > > > Monitor Return Code Description > > > > OCF_NOT_RUNNING Stopped > > > > OCF_SUCCESS Running (Slave) > > > > OCF_RUNNING_MASTERRunning (Master) > > > > OCF_FAILED_MASTER Failed (Master) > > > > Other Failed (Slave) > > > > > > > > > > > > [1] https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ > > > ovnd > > > > b-servers.ocf > > > > Hui. > > > > > > > > > > > > > > > > On Thu, Nov 30, 2017 at 6:39 PM, Hui Xiang > > > > wrote: > > > > > The really weired thing is that the monitor is only called once > > > > > other than expected repeatedly, where should I check for it? > > > > > > > > > > On Thu, Nov 30, 2017 at 4:14 PM, Hui Xiang > > > > > > > > wrote: > > > > > > Thanks Ken very much for your helpful infomation. > > > > > > > > > > > > I am now blocking on I can't see the pacemaker DC do any > > > further > > > > > > start/promote etc action on my resource agents, no helpful > > > logs > > > > > > founded. > > > > > > Each time the DC decides what to do, there will be a line like "... > > > saving inputs in ..." with a file name. The log messages just > > > before > > > that may give some useful information. > > > > > > Otherwise, you can take that file, and simulate what the cluster > > > decided at that point: > > > > > > crm_simulate -Sx $FILENAME > > > > > > It will first show the status of the cluster at the start of the > > > decision-making, then a "Transition Summary" with the actions that > > > are > > > required, then a simulated execution of those actions, and then > > > what > > > the resulting status would be if those actions succeeded. > > > > > > That may give you some more information. You can make it more > > > verbose > > > by using "-Ssx", or by adding "-", but it's not very user- > > > friendly > > > output. > > > > > > > > > > > > > > > So my first question is
Re: [ClusterLabs] Antw: Re: questions about startup fencing
On Tue, 5 Dec 2017 10:05:03 +0100 Tomas Jelinekwrote: > Dne 4.12.2017 v 17:21 Jehan-Guillaume de Rorthais napsal(a): > > On Mon, 4 Dec 2017 16:50:47 +0100 > > Tomas Jelinek wrote: > > > >> Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): > >>> On Mon, 4 Dec 2017 12:31:06 +0100 > >>> Tomas Jelinek wrote: > >>> > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): > > On Fri, 01 Dec 2017 16:34:08 -0600 > > Ken Gaillot wrote: > > > >> On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: > >>> > >>> > Kristoffer Gronlund wrote: > > Adam Spiers writes: > > > >> - The whole cluster is shut down cleanly. > >> > >> - The whole cluster is then started up again. (Side question: > >> what > >> happens if the last node to shut down is not the first to > >> start up? > >> How will the cluster ensure it has the most recent version of > >> the > >> CIB? Without that, how would it know whether the last man > >> standing > >> was shut down cleanly or not?) > > > > This is my opinion, I don't really know what the "official" > > pacemaker > > stance is: There is no such thing as shutting down a cluster > > cleanly. A > > cluster is a process stretching over multiple nodes - if they all > > shut > > down, the process is gone. When you start up again, you > > effectively have > > a completely new cluster. > > Sorry, I don't follow you at all here. When you start the cluster > up > again, the cluster config from before the shutdown is still there. > That's very far from being a completely new cluster :-) > >>> > >>> The problem is you cannot "start the cluster" in pacemaker; you can > >>> only "start nodes". The nodes will come up one by one. As opposed (as > >>> I had said) to HP Sertvice Guard, where there is a "cluster formation > >>> timeout". That is, the nodes wait for the specified time for the > >>> cluster to "form". Then the cluster starts as a whole. Of course that > >>> only applies if the whole cluster was down, not if a single node was > >>> down. > >> > >> I'm not sure what that would specifically entail, but I'm guessing we > >> have some of the pieces already: > >> > >> - Corosync has a wait_for_all option if you want the cluster to be > >> unable to have quorum at start-up until every node has joined. I don't > >> think you can set a timeout that cancels it, though. > >> > >> - Pacemaker will wait dc-deadtime for the first DC election to > >> complete. (if I understand it correctly ...) > >> > >> - Higher-level tools can start or stop all nodes together (e.g. pcs has > >> pcs cluster start/stop --all). > > > > Based on this discussion, I have some questions about pcs: > > > > * how is it shutting down the cluster when issuing "pcs cluster stop > > --all"? > > First, it sends a request to each node to stop pacemaker. The requests > are sent in parallel which prevents resources from being moved from node > to node. Once pacemaker stops on all nodes, corosync is stopped on all > nodes in the same manner. > >>> > >>> What if for some external reasons one node is slower (load, network, > >>> whatever) than the others and start reacting ? Sending queries in parallel > >>> doesn't feels safe enough in regard with all the race conditions that can > >>> occurs in the same time. > >>> > >>> Am I missing something ? > >>> > >> > >> If a node gets the request later than others, some resources may be > >> moved to it before it starts shutting down pacemaker as well. Pcs waits > >> for all nodes to shutdown pacemaker before it moves to shutting down > >> corosync. This way, quorum is maintained the whole time pacemaker is > >> shutting down and therefore no services are blocked from stopping due to > >> lack of quorum. > > > > OK, so if admins or RA expect to start in, the same conditions the cluster > > was shut downed, we have to take care of the shutdown ourselves by hands. > > Considering disabling the resource before shutting down might be the best > > option in the situation as the CRM will take care of switching off things > > correctly in a proper transition. > > My understanding is that pacemaker takes care of switching off things > correctly in a proper transition on its shutdown. So there should be no > extra care needed. Pacemaker developers, however, need to confirm that. Sure, but then, the resource would move away from the node if some other node(s) (with appropriate
Re: [ClusterLabs] Adding Tomcat as a resource to a Cluster on CentOS 7
On 04/12/17 16:29 +, Sean Beeson wrote: Thank you for the replay, Oyvind. I gave it plenty of time to start up. using tomcat_name="tomcat" it starts what I can only call a lifeless PID, but it never seems to actually start up. The catalina.out file is never touched, so it never has anything in it to indicate a problem. Pacemaker does seem to be managing it though because, although this PID shows, it will report it as not running and then move everything to the other node. It will do that a couple times, but that eventually stops as well. Try "pcs resource disable " and then "pcs resource debug-start --full ". The last command will start the resource and show you which commands are run, so you can troubleshoot why it's failing. Kind regards, Sean On Thu, Nov 30, 2017 at 10:40 PM Oyvind Albrigtsenwrote: Tomcat can be very slow at startup depending on the modules you use, so you can either disable modules you arent using to make it start faster or set a higher start timeout via "pcs resource op start interval=". On 30/11/17 13:26 +, Sean Beeson wrote: >Hi, list. > >This is a pretty basic question. I have gone through what I could find on >setting up Tomcat service as a resource to a cluster, but did not find >exactly the issue I am having. Sorry, if it has been covered before. > >I am attempting this on centos-release-7-4.1708.el7.centos.x86_64. >The pcs I have installed is pcs-0.9.158-6.el7.centos.x86_64 >The resource-agents installed is resource-agents-3.9.5-105.el7_4.2.x86_64 > >I have DRBD, MySql, and a virtual IP running spectacularly well and they >failover perfectly and do exactly what I want them. I can add Tomcat as a >resource just fine, but it never starts and I can not fined anything in any >log file that indicates why. Pcs does at some point know to check on it, >but simply says Tomcat is not running. If I run everything manually on in a >cluster I can manually get Tomcat to start with systemctl. Here is how I am >try to configure it. > >[root@centos7-ha-lab-01 ~]# pcs status >Cluster name: ha-cluster >Stack: corosync >Current DC: centos7-ha-lab-02-cr (version 1.1.16-12.el7_4.4-94ff4df) - >partition with quorum >Last updated: Thu Nov 30 21:03:36 2017 >Last change: Thu Nov 30 20:53:37 2017 by root via cibadmin on >centos7-ha-lab-01-cr > >2 nodes configured >6 resources configured > >Online: [ centos7-ha-lab-01-cr centos7-ha-lab-02-cr ] > >Full list of resources: > > Master/Slave Set: DRBD_data_clone [DRBD_data] > Masters: [ centos7-ha-lab-01-cr ] > Slaves: [ centos7-ha-lab-02-cr ] > fsDRBD_data(ocf::heartbeat:Filesystem):Started centos7-ha-lab-01-cr > OuterDB_Service(systemd:mysqld):Started centos7-ha-lab-01-cr > OuterDB_VIP(ocf::heartbeat:IPaddr2):Started centos7-ha-lab-01-cr > tomcat_OuterWeb(ocf::heartbeat:tomcat):Stopped > >Failed Actions: >* tomcat_OuterWeb_start_0 on centos7-ha-lab-01-cr 'unknown error' (1): >call=67, status=Timed Out, exitreason='none', >last-rc-change='Thu Nov 30 20:56:22 2017', queued=0ms, exec=180003ms >* tomcat_OuterWeb_start_0 on centos7-ha-lab-02-cr 'unknown error' (1): >call=57, status=Timed Out, exitreason='none', >last-rc-change='Thu Nov 30 20:53:23 2017', queued=0ms, exec=180003ms > >Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > >I have tried with and without tomcat_name=tomcat_OuterWeb and tomcat and >root for tomcat_user=. Neither work. > >Here is the command I am using to add it. > >pcs resource create tomcat_OuterWeb ocf:heartbeat:tomcat >java_home="/opt/java/jre1.7.0_80" catalina_home="/opt/tomcat7" >catalina_opts="-Dbuild.compiler.emacs=true -Dfile.encoding=UTF-8 >-Djava.util.logging.config.file=/opt/tomcat7/conf/log4j.properties >-Dlog4j.configuration=file:/opt/tomcat7/conf/log4j.properties -Xms1024m >-Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=512m" tomcat_user="root" op >monitor interval="15s" op start timeout="180s" > >I have tried also the most basic. >pcs resource create tomcat_OuterWeb ocf:heartbeat:tomcat >java_home="/opt/java/jre1.7.0_80" catalina_home="/opt/tomcat7" >tomcat_name="tomcat_OuterWeb" tomcat_user="root" op monitor interval="15s" >op start timeout="180s" > >I other examples I have seen they usually use params then the options in >these command to add Tomcat as a resource, but when I use that it tells me >that is an unrecognized option and it then accepts the options without it >just fine. I was led to think this was a difference in version of the >resource-agents perhaps. > >Any idea why I can not get Tomcat to start or some lead to the logging I >could look at to understand why it is failing would be great. Nothing shows >in messages, catalina.out, pcsd.log, nor the resource >log--tomcat_OuterWeb.log. However, it does make the resource log, but it >only has this in it, which seems to be false: > >2017/11/30 20:50:22: start === >2017/11/30 20:53:22: stop
Re: [ClusterLabs] Antw: Re: questions about startup fencing
Dne 4.12.2017 v 17:21 Jehan-Guillaume de Rorthais napsal(a): On Mon, 4 Dec 2017 16:50:47 +0100 Tomas Jelinekwrote: Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a): On Mon, 4 Dec 2017 12:31:06 +0100 Tomas Jelinek wrote: Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a): On Fri, 01 Dec 2017 16:34:08 -0600 Ken Gaillot wrote: On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote: Kristoffer Gronlund wrote: Adam Spiers writes: - The whole cluster is shut down cleanly. - The whole cluster is then started up again. (Side question: what happens if the last node to shut down is not the first to start up? How will the cluster ensure it has the most recent version of the CIB? Without that, how would it know whether the last man standing was shut down cleanly or not?) This is my opinion, I don't really know what the "official" pacemaker stance is: There is no such thing as shutting down a cluster cleanly. A cluster is a process stretching over multiple nodes - if they all shut down, the process is gone. When you start up again, you effectively have a completely new cluster. Sorry, I don't follow you at all here. When you start the cluster up again, the cluster config from before the shutdown is still there. That's very far from being a completely new cluster :-) The problem is you cannot "start the cluster" in pacemaker; you can only "start nodes". The nodes will come up one by one. As opposed (as I had said) to HP Sertvice Guard, where there is a "cluster formation timeout". That is, the nodes wait for the specified time for the cluster to "form". Then the cluster starts as a whole. Of course that only applies if the whole cluster was down, not if a single node was down. I'm not sure what that would specifically entail, but I'm guessing we have some of the pieces already: - Corosync has a wait_for_all option if you want the cluster to be unable to have quorum at start-up until every node has joined. I don't think you can set a timeout that cancels it, though. - Pacemaker will wait dc-deadtime for the first DC election to complete. (if I understand it correctly ...) - Higher-level tools can start or stop all nodes together (e.g. pcs has pcs cluster start/stop --all). Based on this discussion, I have some questions about pcs: * how is it shutting down the cluster when issuing "pcs cluster stop --all"? First, it sends a request to each node to stop pacemaker. The requests are sent in parallel which prevents resources from being moved from node to node. Once pacemaker stops on all nodes, corosync is stopped on all nodes in the same manner. What if for some external reasons one node is slower (load, network, whatever) than the others and start reacting ? Sending queries in parallel doesn't feels safe enough in regard with all the race conditions that can occurs in the same time. Am I missing something ? If a node gets the request later than others, some resources may be moved to it before it starts shutting down pacemaker as well. Pcs waits for all nodes to shutdown pacemaker before it moves to shutting down corosync. This way, quorum is maintained the whole time pacemaker is shutting down and therefore no services are blocked from stopping due to lack of quorum. OK, so if admins or RA expect to start in, the same conditions the cluster was shut downed, we have to take care of the shutdown ourselves by hands. Considering disabling the resource before shutting down might be the best option in the situation as the CRM will take care of switching off things correctly in a proper transition. My understanding is that pacemaker takes care of switching off things correctly in a proper transition on its shutdown. So there should be no extra care needed. Pacemaker developers, however, need to confirm that. That's fine to me, as a cluster shutdown should be part of a controlled procedure. I have to update my online docs I suppose now. Thank you for your answers! ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: questions about startup fencing
Dne 4.12.2017 v 23:17 Ken Gaillot napsal(a): On Mon, 2017-12-04 at 22:08 +0300, Andrei Borzenkov wrote: 04.12.2017 18:47, Tomas Jelinek пишет: Dne 4.12.2017 v 16:02 Kristoffer Grönlund napsal(a): Tomas Jelinekwrites: * how is it shutting down the cluster when issuing "pcs cluster stop --all"? First, it sends a request to each node to stop pacemaker. The requests are sent in parallel which prevents resources from being moved from node to node. Once pacemaker stops on all nodes, corosync is stopped on all nodes in the same manner. * any race condition possible where the cib will record only one node up before the last one shut down? * will the cluster start safely? That definitely sounds racy to me. The best idea I can think of would be to set all nodes except one in standby, and then shutdown pacemaker everywhere... What issues does it solve? Which node should be the one? How do you get the nodes out of standby mode on startup? Is --lifetime=reboot valid for cluster properties? It is accepted by crm_attribute and actually puts value as transient_attribute. standby is a node attribute, so lifetime does apply normally. Right, I forgot about this. I was dealing with 'pcs cluster stop --all' back in January 2015, so I don't remember all the details anymore. However, I was able to dig out the private email thread where stopping a cluster was discussed with pacemaker developers including Andrew Beekhof and David Vossel. Originally, pcs was stopping nodes in parallel in such a manner that each node stopped pacemaker and then corosync independently of other nodes. This caused loss of quorum during stopping the cluster, as nodes hosting resources which stopped fast disconnected from corosync sooner than nodes hosting resources which stopped slowly. Due to quorum missing, some resources could not be stopped and the cluster stop failed. This is covered in here: https://bugzilla.redhat.com/show_bug.cgi?id=1180506 The first attempt to fix the issue was to put nodes into standby mode with --lifetime=reboot: https://github.com/ClusterLabs/pcs/commit/ea6f37983191776fd46d90f22dc1432e0bfc0b91 This didn't work for several reasons. One of them was back then there was no reliable way to set standby mode with --lifetime=reboot for more than one node in a single step. (This may have been fixed in the meantime.) There were however other serious reasons for not putting the nodes into standby as was explained by Andrew: - it [putting the nodes into standby first] means shutdown takes longer (no node stops until all the resources stop) - it makes shutdown more complex (== more fragile), eg... - it result in pcs waiting forever for resources to stop - if a stop fails and the cluster is configured to start at boot, then the node will get fenced and happily run resources when it returns (because all the nodes are up so we still have quorum) - only potentially benefits resources that have no (or very few) dependants and can stop quicker than it takes pcs to get through its "initiate parallel shutdown" loop (which should be rather fast since there is no ssh connection setup overheads) So we ended up with just stopping pacemaker in parallel: https://github.com/ClusterLabs/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d I hope this shed light on why pcs stops clusters the way it does and that standby was considered but rejected for good reasons. Regards, Tomas ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not work
Hi Ken, Thank you for your comment. ("cibadmin --empty" is interesting.) I registered in CLBZ : https://bugs.clusterlabs.org/show_bug.cgi?id=5331 Best Regards > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Saturday, December 02, 2017 8:02 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > Subject: Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not > work > > On Tue, 2017-11-28 at 09:36 +, 井上 和徳 wrote: > > Hi, > > > > Sometimes a node with 'PCMK_node_start_state=standby' will start up > > Online. > > > > [ reproduction scenario ] > > * Set 'PCMK_node_start_state=standby' to /etc/sysconfig/pacemaker. > > * Delete cib (/var/lib/pacemaker/cib/*). > > * Start pacemaker at the same time on 2 nodes. > > # for i in rhel74-1 rhel74-3 ; do ssh -f $i systemctl start > > pacemaker ; done > > > > [ actual result ] > > * crm_mon > > Stack: corosync > > Current DC: rhel74-3 (version 1.1.18-2b07d5c) - partition with > > quorum > > Last change: Wed Nov 22 06:22:50 2017 by hacluster via crmd on > > rhel74-3 > > > > 2 nodes configured > > 0 resources configured > > > > Node rhel74-3: standby > > Online: [ rhel74-1 ] > > > > * cib.xml > > > > > > > > > > > value="on"/> > > > > > > > > > > * pacemaker.log > > Nov 22 06:22:50 [20755] rhel74-1 crmd: (cib_native.c:462 ) > > warning: cib_native_perform_op_delegate:Call failed: No such > > device or address > > Nov 22 06:22:50 [20755] rhel74-1 crmd: ( cib_attrs.c:320 > > )info: update_attr_delegate:Update > id="3232261507"> > > Nov 22 06:22:50 [20755] rhel74-1 crmd: ( cib_attrs.c:320 > > )info: update_attr_delegate:Update > es id="nodes-3232261507"> > > Nov 22 06:22:50 [20755] rhel74-1 crmd: ( cib_attrs.c:320 > > )info: update_attr_delegate:Update > id="nodes-3232261507-standby" name="standby" value="on"/> > > Nov 22 06:22:50 [20755] rhel74-1 crmd: ( cib_attrs.c:320 > > )info: update_attr_delegate:Update > tes> > > Nov 22 06:22:50 [20755] rhel74-1 crmd: ( cib_attrs.c:320 > > )info: update_attr_delegate:Update > > > > * I attached crm_report to GitHub (too big to attach to this email), > > so look at it. > > https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Wed-22-N > > ov-2017.tar.bz2 > > > > > > I think that the additional timing of *1 and > > *2 is the cause. > > *1 ' > > *2 > > > value="on"/> > > > > I expect to be fixed, but if it's difficult, I have two questions. > > 1) Does this only occur if there is no cib.xml (in other words, there > > is no element)? > > I believe so. I think this is the key message: > > Nov 22 06:22:50 [20750] rhel74-1cib: ( callbacks.c:1101 ) > warning: cib_process_request:Completed cib_modify operation for > section nodes: No such device or address (rc=-6, origin=rhel74- > 1/crmd/12, version=0.3.0) > > PCMK_node_start_state works by setting the "standby" node attribute in > the CIB. However, it does this via a "modify" command that assumes the > tag already exists. > > If there is no CIB, pacemaker will quickly create one -- but in this > case, the node tries to set the attribute before that's happened. > > Hopefully we can come up with a fix. If you want, you can file a bug > report at bugs.clusterlabs.org, to track the progress. > > > 2) Is there any workaround other than "Do not start at the same > > time"? > > > > Best Regards > > Before starting pacemaker, if /var/lib/pacemaker/cib is empty, you can > create a skeleton CIB with: > > cibadmin --empty > /var/lib/pacemaker/cib/cib.xml > > That will include an empty tag, and the modify command should > work when pacemaker starts. > -- > Ken Gaillot> > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.
On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote: > 04.12.2017 14:48, Gao,Yan пишет: > > On 12/02/2017 07:19 PM, Andrei Borzenkov wrote: > >> 30.11.2017 13:48, Gao,Yan пишет: > >>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: > SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with > VM on VSphere using shared VMDK as SBD. During basic tests by killing > corosync and forcing STONITH pacemaker was not started after reboot. > In logs I see during boot > > Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly > just fenced by sapprod01p for sapprod01p > Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd > process (3151) can no longer be respawned, > Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down > Pacemaker > > SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that > stonith with SBD always takes msgwait (at least, visually host is not > declared as OFFLINE until 120s passed). But VM rebots lightning fast > and is up and running long before timeout expires. > > I think I have seen similar report already. Is it something that can > be fixed by SBD/pacemaker tuning? > >>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution. > >>> > >> > >> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has > >> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch > >> disk at all. > > It simply waits that long on startup before starting the rest of the > > cluster stack to make sure the fencing that targeted it has returned. It > > intentionally doesn't watch anything during this period of time. > > > > Unfortunately it waits too long. > > ha1:~ # systemctl status sbd.service > ● sbd.service - Shared-storage based fencing daemon >Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor > preset: disabled) >Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK; > 4min 16s ago > Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited, > status=0/SUCCESS) > Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid > watch (code=killed, signa > Main PID: 1792 (code=exited, status=0/SUCCESS) > > дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing > daemon... > дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out. > Terminating. > дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based > fencing daemon. > дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state. > дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'. > > But the real problem is - in spite of SBD failed to start, the whole > cluster stack continues to run; and because SBD blindly trusts in well > behaving nodes, fencing appears to succeed after timeout ... without > anyone taking any action on poison pill ... That's something I always wondered about: if a node is capable of reading a poison pill then it could before shutdown also write an "I'm leaving" message into its slot. Wouldn't that make sbd more reliable? Any reason not to implement that? Thanks, Dejan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org