Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Andrei Borzenkov
05.12.2017 13:34, Gao,Yan пишет:
> On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote:
>> On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:
>>> 04.12.2017 14:48, Gao,Yan пишет:
 On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
> 30.11.2017 13:48, Gao,Yan пишет:
>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>>> VM on VSphere using shared VMDK as SBD. During basic tests by
>>> killing
>>> corosync and forcing STONITH pacemaker was not started after reboot.
>>> In logs I see during boot
>>>
>>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>>> just fenced by sapprod01p for sapprod01p
>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>>> process (3151) can no longer be respawned,
>>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>>> Pacemaker
>>>
>>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems
>>> that
>>> stonith with SBD always takes msgwait (at least, visually host is
>>> not
>>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>>> and is up and running long before timeout expires.
>>>
>>> I think I have seen similar report already. Is it something that can
>>> be fixed by SBD/pacemaker tuning?
>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
>>
>
> I tried it (on openSUSE Tumbleweed which is what I have at hand, it
> has
> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
> disk at all.
 It simply waits that long on startup before starting the rest of the
 cluster stack to make sure the fencing that targeted it has
 returned. It
 intentionally doesn't watch anything during this period of time.

>>>
>>> Unfortunately it waits too long.
>>>
>>> ha1:~ # systemctl status sbd.service
>>> ● sbd.service - Shared-storage based fencing daemon
>>>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
>>> preset: disabled)
>>>     Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
>>> 4min 16s ago
>>>    Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
>>> status=0/SUCCESS)
>>>    Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
>>> watch (code=killed, signa
>>>   Main PID: 1792 (code=exited, status=0/SUCCESS)
>>>
>>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
>>> daemon...
>>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
>>> Terminating.
>>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
>>> fencing daemon.
>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
>>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result
>>> 'timeout'.
>>>
>>> But the real problem is - in spite of SBD failed to start, the whole
>>> cluster stack continues to run; and because SBD blindly trusts in well
>>> behaving nodes, fencing appears to succeed after timeout ... without
>>> anyone taking any action on poison pill ...
>>
>> That's something I always wondered about: if a node is capable of
>> reading a poison pill then it could before shutdown also write an
>> "I'm leaving" message into its slot. Wouldn't that make sbd more
>> reliable? Any reason not to implement that?
> Probably it's not considered necessary :) SBD is a fencing mechanism
> which only needs to ensure fencing works.

I'm sorry, but SBD has zero chances to ensure fencing works. Recently I
did storage vMotion of VM with shared VMDK for SBD - it silently created
copy of VMDK which was indistinguishable from original one. As result
both VMs run with own copy. Of course fencing did not work - but each VM
*assumed* it worked because it posted message and waited for timeout ...

I would expect "monitor" action of SBD fencing agent to actually test
whether messages are seen by remote node(s) ...

> SBD on the fencing target is
> either there eating the pill or getting reset by watchdog, otherwise
> it's not there which is supposed to imply the whole cluster stack is not
> running so that it doesn't need to actually eat the pill.
> 
> How systemd should handle the service dependencies is another topic...
> 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Andrei Borzenkov
05.12.2017 12:59, Gao,Yan пишет:
> On 12/04/2017 07:55 PM, Andrei Borzenkov wrote:
>> 04.12.2017 14:48, Gao,Yan пишет:
>>> On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
 30.11.2017 13:48, Gao,Yan пишет:
> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>> corosync and forcing STONITH pacemaker was not started after reboot.
>> In logs I see during boot
>>
>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>> just fenced by sapprod01p for sapprod01p
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>> process (3151) can no longer be respawned,
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>> Pacemaker
>>
>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>> stonith with SBD always takes msgwait (at least, visually host is not
>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>> and is up and running long before timeout expires.
>>
>> I think I have seen similar report already. Is it something that can
>> be fixed by SBD/pacemaker tuning?
> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
>

 I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
 SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
 disk at all.
>>> It simply waits that long on startup before starting the rest of the
>>> cluster stack to make sure the fencing that targeted it has returned. It
>>> intentionally doesn't watch anything during this period of time.
>>>
>>
>> Unfortunately it waits too long.
>>
>> ha1:~ # systemctl status sbd.service
>> ● sbd.service - Shared-storage based fencing daemon
>>     Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
>> preset: disabled)
>>     Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
>> 4min 16s ago
>>    Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
>> status=0/SUCCESS)
>>    Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
>> watch (code=killed, signa
>>   Main PID: 1792 (code=exited, status=0/SUCCESS)
>>
>> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
>> daemon...
>> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
>> Terminating.
>> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
>> fencing daemon.
>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
>> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result
>> 'timeout'.
>>
>> But the real problem is - in spite of SBD failed to start, the whole
>> cluster stack continues to run; and because SBD blindly trusts in well
>> behaving nodes, fencing appears to succeed after timeout ... without
>> anyone taking any action on poison pill ...
> Start of sbd reaches systemd's timeout for starting units and systemd
> proceeds...
> 

You consider it normal and intended behavior? Again - currently it is
possible that cluster stack starts without having working STONITH and
because there is no confirmation whether stonith via SBD worked at all,
we get into split brain.

> TimeoutStartSec should be configured in sbd.service accordingly to be
> longer than msgwait.
> 

And where is it documented? You did not say it earlier,
/etc/sysconfig/sbd does not say it, "man sbd" does not say it. How
should users be aware about this?


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ken Gaillot
On Tue, 2017-12-05 at 17:43 +0100, Jehan-Guillaume de Rorthais wrote:
> On Tue, 05 Dec 2017 08:59:55 -0600
> Ken Gaillot  wrote:
> 
> > On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote:
> > > > > > Tomas Jelinek  schrieb am 04.12.2017
> > > > > > um
> > > > > > 16:50 in Nachricht
> > > 
> > > <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>:
> > > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
> > > > > On Mon, 4 Dec 2017 12:31:06 +0100
> > > > > Tomas Jelinek  wrote:
> > > > > 
> > > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais
> > > > > > napsal(a):
> > > > > > > On Fri, 01 Dec 2017 16:34:08 -0600
> > > > > > > Ken Gaillot  wrote:
> > > > > > >    
> > > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
> > > > > > > > > 
> > > > > > > > >   
> > > > > > > > > > Kristoffer Gronlund  wrote:
> > > > > > > > > > > Adam Spiers  writes:
> > > > > > > > > > >   
> > > > > > > > > > > > - The whole cluster is shut down cleanly.
> > > > > > > > > > > > 
> > > > > > > > > > > > - The whole cluster is then started up
> > > > > > > > > > > > again.  (Side question:
> > > > > > > > > > > > what
> > > > > > > > > > > > happens if the last node to shut down is
> > > > > > > > > > > > not
> > > > > > > > > > > > the first to
> > > > > > > > > > > > start up?
> > > > > > > > > > > > How will the cluster ensure it has the most
> > > > > > > > > > > > recent version of
> > > > > > > > > > > > the
> > > > > > > > > > > > CIB?  Without that, how would it know
> > > > > > > > > > > > whether
> > > > > > > > > > > > the last man
> > > > > > > > > > > > standing
> > > > > > > > > > > > was shut down cleanly or not?)
> > > > > > > > > > > 
> > > > > > > > > > > This is my opinion, I don't really know what the
> > > > > > > > > > > "official"
> > > > > > > > > > > pacemaker
> > > > > > > > > > > stance is: There is no such thing as shutting
> > > > > > > > > > > down a
> > > > > > > > > > > cluster
> > > > > > > > > > > cleanly. A
> > > > > > > > > > > cluster is a process stretching over multiple
> > > > > > > > > > > nodes -
> > > > > > > > > > > if they all
> > > > > > > > > > > shut
> > > > > > > > > > > down, the process is gone. When you start up
> > > > > > > > > > > again,
> > > > > > > > > > > you
> > > > > > > > > > > effectively have
> > > > > > > > > > > a completely new cluster.
> > > > > > > > > > 
> > > > > > > > > > Sorry, I don't follow you at all here.  When you
> > > > > > > > > > start
> > > > > > > > > > the cluster
> > > > > > > > > > up
> > > > > > > > > > again, the cluster config from before the shutdown
> > > > > > > > > > is
> > > > > > > > > > still there.
> > > > > > > > > > That's very far from being a completely new cluster
> > > > > > > > > > :-)
> > > > > > > > > 
> > > > > > > > > The problem is you cannot "start the cluster" in
> > > > > > > > > pacemaker; you can
> > > > > > > > > only "start nodes". The nodes will come up one by
> > > > > > > > > one. As
> > > > > > > > > opposed (as
> > > > > > > > > I had said) to HP Sertvice Guard, where there is a
> > > > > > > > > "cluster formation
> > > > > > > > > timeout". That is, the nodes wait for the specified
> > > > > > > > > time
> > > > > > > > > for the
> > > > > > > > > cluster to "form". Then the cluster starts as a
> > > > > > > > > whole. Of
> > > > > > > > > course that
> > > > > > > > > only applies if the whole cluster was down, not if a
> > > > > > > > > single node was
> > > > > > > > > down.
> > > > > > > > 
> > > > > > > > I'm not sure what that would specifically entail, but
> > > > > > > > I'm
> > > > > > > > guessing we
> > > > > > > > have some of the pieces already:
> > > > > > > > 
> > > > > > > > - Corosync has a wait_for_all option if you want the
> > > > > > > > cluster to be
> > > > > > > > unable to have quorum at start-up until every node has
> > > > > > > > joined. I don't
> > > > > > > > think you can set a timeout that cancels it, though.
> > > > > > > > 
> > > > > > > > - Pacemaker will wait dc-deadtime for the first DC
> > > > > > > > election
> > > > > > > > to
> > > > > > > > complete. (if I understand it correctly ...)
> > > > > > > > 
> > > > > > > > - Higher-level tools can start or stop all nodes
> > > > > > > > together
> > > > > > > > (e.g. pcs has
> > > > > > > > pcs cluster start/stop --all).
> > > > > > > 
> > > > > > > Based on this discussion, I have some questions about
> > > > > > > pcs:
> > > > > > > 
> > > > > > > * how is it shutting down the cluster when issuing "pcs
> > > > > > > cluster stop
> > > > > > > --all"?
> > > > > > 
> > > > > > First, it sends a request to each node to stop pacemaker.
> > > > > > The
> > > > > > requests
> > > > > > are sent in parallel which prevents resources from being
> > > > > > moved
> > > > > > from node
> > > > > > to node. Once pacemaker stops on all nodes, corosync 

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Gao,Yan

On 12/05/2017 03:11 PM, Ulrich Windl wrote:




"Gao,Yan"  schrieb am 05.12.2017 um 15:04 in Nachricht

:

On 12/05/2017 12:41 PM, Ulrich Windl wrote:




"Gao,Yan"  schrieb am 01.12.2017 um 20:36 in Nachricht

:


[...]


I meant: There are three delays:
1) The delay until data is on the disk

It takes several IOs for the sender to do this -- read the device
header, lookup the slot, write the message and verify the message is
written (-- A timeout_io defaults to 3s).

As mentioned, msgwait timer of the sender starts only after message has
been verified to be written. We just need to make sure stonith-timeout
is configured longer enough than the sum.


2) Delay until date is read from the disk

It's already taken into account with msgwait. Considering the recipient
keeps reading in a loop, we don't know when exactly it starts to read
for this specific message. But once it starts a reading, it has to be
done within timeout_watchdog, otherwise watchdog triggers. So even for a
bad case, the message should be read within 2* timemout_watchdog. That's
the reason why the sender has to wait msgwait, which is 2 *
timeout_watchdog.


3) Delay until Host was killed

Kill is basically immediately triggered once poison pill is read.


Considering that the response time of a SAN disk system with cache is typically a very 
few microseconds, writing to disk may be even "more immediate" than killing the 
node via watchdog reset ;-)
Well, it's possible :) Timeout matters for "bad cases" though. Compared 
with a disk io facing difficulties like path failure and so on, 
triggering watchdog is trivial.



So you can't easily say one is immediate, while the other has to be waited for 
IMHO.
Of course a even longer msgwait with all the factors that you can think 
of taken into account will be even safer.


Regards,
  Yan



Regards,
Ulrich




A confirmation before 3) could shorten the total wait that includes 2) and

3),

right?

As mentioned in another email, an alive node, even indeed coming back
from death, cannot actually confirm itself or even give a confirmation
about if it was ever dead. And a successful fencing means the node being
dead.

Regards,
Yan




Regards,
Ulrich




Regards,
 Yan


[...]


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Jehan-Guillaume de Rorthais
On Tue, 05 Dec 2017 08:59:55 -0600
Ken Gaillot  wrote:

> On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote:
> > > > > Tomas Jelinek  schrieb am 04.12.2017 um
> > > > > 16:50 in Nachricht
> > 
> > <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>:
> > > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
> > > > On Mon, 4 Dec 2017 12:31:06 +0100
> > > > Tomas Jelinek  wrote:
> > > > 
> > > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):
> > > > > > On Fri, 01 Dec 2017 16:34:08 -0600
> > > > > > Ken Gaillot  wrote:
> > > > > >    
> > > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
> > > > > > > > 
> > > > > > > >   
> > > > > > > > > Kristoffer Gronlund  wrote:
> > > > > > > > > > Adam Spiers  writes:
> > > > > > > > > >   
> > > > > > > > > > > - The whole cluster is shut down cleanly.
> > > > > > > > > > > 
> > > > > > > > > > > - The whole cluster is then started up
> > > > > > > > > > > again.  (Side question:
> > > > > > > > > > > what
> > > > > > > > > > > happens if the last node to shut down is not
> > > > > > > > > > > the first to
> > > > > > > > > > > start up?
> > > > > > > > > > > How will the cluster ensure it has the most
> > > > > > > > > > > recent version of
> > > > > > > > > > > the
> > > > > > > > > > > CIB?  Without that, how would it know whether
> > > > > > > > > > > the last man
> > > > > > > > > > > standing
> > > > > > > > > > > was shut down cleanly or not?)
> > > > > > > > > > 
> > > > > > > > > > This is my opinion, I don't really know what the
> > > > > > > > > > "official"
> > > > > > > > > > pacemaker
> > > > > > > > > > stance is: There is no such thing as shutting down a
> > > > > > > > > > cluster
> > > > > > > > > > cleanly. A
> > > > > > > > > > cluster is a process stretching over multiple nodes -
> > > > > > > > > > if they all
> > > > > > > > > > shut
> > > > > > > > > > down, the process is gone. When you start up again,
> > > > > > > > > > you
> > > > > > > > > > effectively have
> > > > > > > > > > a completely new cluster.
> > > > > > > > > 
> > > > > > > > > Sorry, I don't follow you at all here.  When you start
> > > > > > > > > the cluster
> > > > > > > > > up
> > > > > > > > > again, the cluster config from before the shutdown is
> > > > > > > > > still there.
> > > > > > > > > That's very far from being a completely new cluster :-)
> > > > > > > > 
> > > > > > > > The problem is you cannot "start the cluster" in
> > > > > > > > pacemaker; you can
> > > > > > > > only "start nodes". The nodes will come up one by one. As
> > > > > > > > opposed (as
> > > > > > > > I had said) to HP Sertvice Guard, where there is a
> > > > > > > > "cluster formation
> > > > > > > > timeout". That is, the nodes wait for the specified time
> > > > > > > > for the
> > > > > > > > cluster to "form". Then the cluster starts as a whole. Of
> > > > > > > > course that
> > > > > > > > only applies if the whole cluster was down, not if a
> > > > > > > > single node was
> > > > > > > > down.
> > > > > > > 
> > > > > > > I'm not sure what that would specifically entail, but I'm
> > > > > > > guessing we
> > > > > > > have some of the pieces already:
> > > > > > > 
> > > > > > > - Corosync has a wait_for_all option if you want the
> > > > > > > cluster to be
> > > > > > > unable to have quorum at start-up until every node has
> > > > > > > joined. I don't
> > > > > > > think you can set a timeout that cancels it, though.
> > > > > > > 
> > > > > > > - Pacemaker will wait dc-deadtime for the first DC election
> > > > > > > to
> > > > > > > complete. (if I understand it correctly ...)
> > > > > > > 
> > > > > > > - Higher-level tools can start or stop all nodes together
> > > > > > > (e.g. pcs has
> > > > > > > pcs cluster start/stop --all).
> > > > > > 
> > > > > > Based on this discussion, I have some questions about pcs:
> > > > > > 
> > > > > > * how is it shutting down the cluster when issuing "pcs
> > > > > > cluster stop
> > > > > > --all"?
> > > > > 
> > > > > First, it sends a request to each node to stop pacemaker. The
> > > > > requests
> > > > > are sent in parallel which prevents resources from being moved
> > > > > from node
> > > > > to node. Once pacemaker stops on all nodes, corosync is stopped
> > > > > on all
> > > > > nodes in the same manner.
> > > > 
> > > > What if for some external reasons one node is slower (load,
> > > > network, 
> > > 
> > > whatever)
> > > > than the others and start reacting ? Sending queries in parallel
> > > > doesn't
> > > > feels safe enough in regard with all the race conditions that can
> > > > occurs in 
> > > 
> > > the
> > > > same time.
> > > > 
> > > > Am I missing something ?
> > > > 
> > > 
> > > If a node gets the request later than others, some resources may
> > > be 
> > > moved to it before it starts shutting down pacemaker as well. Pcs
> 

Re: [ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ken Gaillot
On Tue, 2017-12-05 at 14:47 +0100, Ulrich Windl wrote:
> > > > Tomas Jelinek  schrieb am 04.12.2017 um
> > > > 16:50 in Nachricht
> 
> <3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>:
> > Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
> > > On Mon, 4 Dec 2017 12:31:06 +0100
> > > Tomas Jelinek  wrote:
> > > 
> > > > Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):
> > > > > On Fri, 01 Dec 2017 16:34:08 -0600
> > > > > Ken Gaillot  wrote:
> > > > >    
> > > > > > On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
> > > > > > > 
> > > > > > >   
> > > > > > > > Kristoffer Gronlund  wrote:
> > > > > > > > > Adam Spiers  writes:
> > > > > > > > >   
> > > > > > > > > > - The whole cluster is shut down cleanly.
> > > > > > > > > > 
> > > > > > > > > > - The whole cluster is then started up
> > > > > > > > > > again.  (Side question:
> > > > > > > > > > what
> > > > > > > > > > happens if the last node to shut down is not
> > > > > > > > > > the first to
> > > > > > > > > > start up?
> > > > > > > > > > How will the cluster ensure it has the most
> > > > > > > > > > recent version of
> > > > > > > > > > the
> > > > > > > > > > CIB?  Without that, how would it know whether
> > > > > > > > > > the last man
> > > > > > > > > > standing
> > > > > > > > > > was shut down cleanly or not?)
> > > > > > > > > 
> > > > > > > > > This is my opinion, I don't really know what the
> > > > > > > > > "official"
> > > > > > > > > pacemaker
> > > > > > > > > stance is: There is no such thing as shutting down a
> > > > > > > > > cluster
> > > > > > > > > cleanly. A
> > > > > > > > > cluster is a process stretching over multiple nodes -
> > > > > > > > > if they all
> > > > > > > > > shut
> > > > > > > > > down, the process is gone. When you start up again,
> > > > > > > > > you
> > > > > > > > > effectively have
> > > > > > > > > a completely new cluster.
> > > > > > > > 
> > > > > > > > Sorry, I don't follow you at all here.  When you start
> > > > > > > > the cluster
> > > > > > > > up
> > > > > > > > again, the cluster config from before the shutdown is
> > > > > > > > still there.
> > > > > > > > That's very far from being a completely new cluster :-)
> > > > > > > 
> > > > > > > The problem is you cannot "start the cluster" in
> > > > > > > pacemaker; you can
> > > > > > > only "start nodes". The nodes will come up one by one. As
> > > > > > > opposed (as
> > > > > > > I had said) to HP Sertvice Guard, where there is a
> > > > > > > "cluster formation
> > > > > > > timeout". That is, the nodes wait for the specified time
> > > > > > > for the
> > > > > > > cluster to "form". Then the cluster starts as a whole. Of
> > > > > > > course that
> > > > > > > only applies if the whole cluster was down, not if a
> > > > > > > single node was
> > > > > > > down.
> > > > > > 
> > > > > > I'm not sure what that would specifically entail, but I'm
> > > > > > guessing we
> > > > > > have some of the pieces already:
> > > > > > 
> > > > > > - Corosync has a wait_for_all option if you want the
> > > > > > cluster to be
> > > > > > unable to have quorum at start-up until every node has
> > > > > > joined. I don't
> > > > > > think you can set a timeout that cancels it, though.
> > > > > > 
> > > > > > - Pacemaker will wait dc-deadtime for the first DC election
> > > > > > to
> > > > > > complete. (if I understand it correctly ...)
> > > > > > 
> > > > > > - Higher-level tools can start or stop all nodes together
> > > > > > (e.g. pcs has
> > > > > > pcs cluster start/stop --all).
> > > > > 
> > > > > Based on this discussion, I have some questions about pcs:
> > > > > 
> > > > > * how is it shutting down the cluster when issuing "pcs
> > > > > cluster stop
> > > > > --all"?
> > > > 
> > > > First, it sends a request to each node to stop pacemaker. The
> > > > requests
> > > > are sent in parallel which prevents resources from being moved
> > > > from node
> > > > to node. Once pacemaker stops on all nodes, corosync is stopped
> > > > on all
> > > > nodes in the same manner.
> > > 
> > > What if for some external reasons one node is slower (load,
> > > network, 
> > 
> > whatever)
> > > than the others and start reacting ? Sending queries in parallel
> > > doesn't
> > > feels safe enough in regard with all the race conditions that can
> > > occurs in 
> > 
> > the
> > > same time.
> > > 
> > > Am I missing something ?
> > > 
> > 
> > If a node gets the request later than others, some resources may
> > be 
> > moved to it before it starts shutting down pacemaker as well. Pcs
> > waits 
> 
> I think that's impossible due to the ordering of corosync: If a
> standby is issued, and a resource migration is the consequence, every
> node will see the standby before it sees any other config change.
> Right?

pcs doesn't issue a standby, just a shutdown.

When a node needs to shut down, it sends 

[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Ulrich Windl


>>> "Gao,Yan"  schrieb am 05.12.2017 um 15:04 in Nachricht
:
> On 12/05/2017 12:41 PM, Ulrich Windl wrote:
>> 
>> 
> "Gao,Yan"  schrieb am 01.12.2017 um 20:36 in Nachricht
>> :

[...]
>> 
>> I meant: There are three delays:
>> 1) The delay until data is on the disk
> It takes several IOs for the sender to do this -- read the device 
> header, lookup the slot, write the message and verify the message is 
> written (-- A timeout_io defaults to 3s).
> 
> As mentioned, msgwait timer of the sender starts only after message has 
> been verified to be written. We just need to make sure stonith-timeout 
> is configured longer enough than the sum.
> 
>> 2) Delay until date is read from the disk
> It's already taken into account with msgwait. Considering the recipient 
> keeps reading in a loop, we don't know when exactly it starts to read 
> for this specific message. But once it starts a reading, it has to be 
> done within timeout_watchdog, otherwise watchdog triggers. So even for a 
> bad case, the message should be read within 2* timemout_watchdog. That's 
> the reason why the sender has to wait msgwait, which is 2 * 
> timeout_watchdog.
> 
>> 3) Delay until Host was killed
> Kill is basically immediately triggered once poison pill is read.

Considering that the response time of a SAN disk system with cache is typically 
a very few microseconds, writing to disk may be even "more immediate" than 
killing the node via watchdog reset ;-)
So you can't easily say one is immediate, while the other has to be waited for 
IMHO.

Regards,
Ulrich

> 
>> A confirmation before 3) could shorten the total wait that includes 2) and 
> 3),
>> right?
> As mentioned in another email, an alive node, even indeed coming back 
> from death, cannot actually confirm itself or even give a confirmation 
> about if it was ever dead. And a successful fencing means the node being 
> dead.
> 
> Regards,
>Yan
> 
> 
>> 
>> Regards,
>> Ulrich
>> 
>> 
>>>
>>> Regards,
>>> Yan
>>>
[...]


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Gao,Yan

On 12/05/2017 12:41 PM, Ulrich Windl wrote:




"Gao,Yan"  schrieb am 01.12.2017 um 20:36 in Nachricht

:

On 11/30/2017 06:48 PM, Andrei Borzenkov wrote:

30.11.2017 16:11, Klaus Wenninger пишет:

On 11/30/2017 01:41 PM, Ulrich Windl wrote:



"Gao,Yan"  schrieb am 30.11.2017 um 11:48 in Nachricht

:

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:

SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot

Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down

Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

As msgwait was intended for the message to arrive, and not for the reboot



time (I guess), this just shows a fundamental problem in SBD design: Receipt



of the fencing command is not confirmed (other than by seeing the
consequences of ist execution).


The 2 x msgwait is not for confirmations but for writing the poison-pill
and for
having it read by the target-side.


Yes, of course, but that's not what Urlich likely intended to say.
msgwait must account for worst case storage path latency, while in
normal cases it happens much faster. If fenced node could acknowledge
having been killed after reboot, stonith agent could return success much
earlier.

How could an alive man be sure he died before? ;)


I meant: There are three delays:
1) The delay until data is on the disk
It takes several IOs for the sender to do this -- read the device 
header, lookup the slot, write the message and verify the message is 
written (-- A timeout_io defaults to 3s).


As mentioned, msgwait timer of the sender starts only after message has 
been verified to be written. We just need to make sure stonith-timeout 
is configured longer enough than the sum.



2) Delay until date is read from the disk
It's already taken into account with msgwait. Considering the recipient 
keeps reading in a loop, we don't know when exactly it starts to read 
for this specific message. But once it starts a reading, it has to be 
done within timeout_watchdog, otherwise watchdog triggers. So even for a 
bad case, the message should be read within 2* timemout_watchdog. That's 
the reason why the sender has to wait msgwait, which is 2 * 
timeout_watchdog.



3) Delay until Host was killed

Kill is basically immediately triggered once poison pill is read.


A confirmation before 3) could shorten the total wait that includes 2) and 3),
right?
As mentioned in another email, an alive node, even indeed coming back 
from death, cannot actually confirm itself or even give a confirmation 
about if it was ever dead. And a successful fencing means the node being 
dead.


Regards,
  Yan




Regards,
Ulrich




Regards,
Yan



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ulrich Windl


>>> Tomas Jelinek  schrieb am 04.12.2017 um 16:50 in 
>>> Nachricht
<3e60579c-0f4d-1c32-70fc-d207e0654...@redhat.com>:
> Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):
>> On Mon, 4 Dec 2017 12:31:06 +0100
>> Tomas Jelinek  wrote:
>> 
>>> Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):
 On Fri, 01 Dec 2017 16:34:08 -0600
 Ken Gaillot  wrote:

> On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:
>>
>>   
>>> Kristoffer Gronlund  wrote:
 Adam Spiers  writes:
   
> - The whole cluster is shut down cleanly.
>
> - The whole cluster is then started up again.  (Side question:
> what
> happens if the last node to shut down is not the first to
> start up?
> How will the cluster ensure it has the most recent version of
> the
> CIB?  Without that, how would it know whether the last man
> standing
> was shut down cleanly or not?)

 This is my opinion, I don't really know what the "official"
 pacemaker
 stance is: There is no such thing as shutting down a cluster
 cleanly. A
 cluster is a process stretching over multiple nodes - if they all
 shut
 down, the process is gone. When you start up again, you
 effectively have
 a completely new cluster.
>>>
>>> Sorry, I don't follow you at all here.  When you start the cluster
>>> up
>>> again, the cluster config from before the shutdown is still there.
>>> That's very far from being a completely new cluster :-)
>>
>> The problem is you cannot "start the cluster" in pacemaker; you can
>> only "start nodes". The nodes will come up one by one. As opposed (as
>> I had said) to HP Sertvice Guard, where there is a "cluster formation
>> timeout". That is, the nodes wait for the specified time for the
>> cluster to "form". Then the cluster starts as a whole. Of course that
>> only applies if the whole cluster was down, not if a single node was
>> down.
>
> I'm not sure what that would specifically entail, but I'm guessing we
> have some of the pieces already:
>
> - Corosync has a wait_for_all option if you want the cluster to be
> unable to have quorum at start-up until every node has joined. I don't
> think you can set a timeout that cancels it, though.
>
> - Pacemaker will wait dc-deadtime for the first DC election to
> complete. (if I understand it correctly ...)
>
> - Higher-level tools can start or stop all nodes together (e.g. pcs has
> pcs cluster start/stop --all).

 Based on this discussion, I have some questions about pcs:

 * how is it shutting down the cluster when issuing "pcs cluster stop
 --all"?
>>>
>>> First, it sends a request to each node to stop pacemaker. The requests
>>> are sent in parallel which prevents resources from being moved from node
>>> to node. Once pacemaker stops on all nodes, corosync is stopped on all
>>> nodes in the same manner.
>> 
>> What if for some external reasons one node is slower (load, network, 
> whatever)
>> than the others and start reacting ? Sending queries in parallel doesn't
>> feels safe enough in regard with all the race conditions that can occurs in 
> the
>> same time.
>> 
>> Am I missing something ?
>> 
> 
> If a node gets the request later than others, some resources may be 
> moved to it before it starts shutting down pacemaker as well. Pcs waits 

I think that's impossible due to the ordering of corosync: If a standby is 
issued, and a resource migration is the consequence, every node will see the 
standby before it sees any other config change. Right?

> for all nodes to shutdown pacemaker before it moves to shutting down 
> corosync. This way, quorum is maintained the whole time pacemaker is 
> shutting down and therefore no services are blocked from stopping due to 
> lack of quorum.
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ulrich Windl


>>> Klaus Wenninger  schrieb am 04.12.2017 um 16:20 in
Nachricht <24ed4710-322d-0560-b0ff-792cbf53d...@redhat.com>:
> On 12/04/2017 04:02 PM, Kristoffer Grönlund wrote:
>> Tomas Jelinek  writes:
>>
 * how is it shutting down the cluster when issuing "pcs cluster stop
--all"?
>>> First, it sends a request to each node to stop pacemaker. The requests 
>>> are sent in parallel which prevents resources from being moved from node 
>>> to node. Once pacemaker stops on all nodes, corosync is stopped on all 
>>> nodes in the same manner.
>>>
 * any race condition possible where the cib will record only one node up

> before
the last one shut down?
 * will the cluster start safely?
>> That definitely sounds racy to me. The best idea I can think of would be
>> to set all nodes except one in standby, and then shutdown pacemaker
>> everywhere...
> 
> Really mean standby or rather maintenance to keep resources
> from switching to the still alive nodes during shutdown?

If all nodes are set to standby in one transaction, there is no switching,
just a stop of all resources.

> 
> Regards,
> Klaus
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ulrich Windl


>>> Kristoffer Grönlund  schrieb am 04.12.2017 um 16:02
in
Nachricht <87fu8qedu0@suse.com>:
> Tomas Jelinek  writes:
> 
>>> 
>>> * how is it shutting down the cluster when issuing "pcs cluster stop
--all"?
>>
>> First, it sends a request to each node to stop pacemaker. The requests 
>> are sent in parallel which prevents resources from being moved from node 
>> to node. Once pacemaker stops on all nodes, corosync is stopped on all 
>> nodes in the same manner.
>>
>>> * any race condition possible where the cib will record only one node up 
> before
>>>the last one shut down?
>>> * will the cluster start safely?
> 
> That definitely sounds racy to me. The best idea I can think of would be
> to set all nodes except one in standby, and then shutdown pacemaker
> everywhere...

Why not all nodes?

> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: questions about startup fencing

2017-12-05 Thread Ulrich Windl


>>> Jehan-Guillaume de Rorthais  schrieb am 04.12.2017 um 
>>> 14:21 in
Nachricht <20171204142148.446ec356@firost>:
> On Mon, 4 Dec 2017 12:31:06 +0100
> Tomas Jelinek  wrote:
> 
>> Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):
>> > On Fri, 01 Dec 2017 16:34:08 -0600
>> > Ken Gaillot  wrote:
>> >   
>> >> On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:  
>> >>>
>> >>>  
>>  Kristoffer Gronlund  wrote:  
>> > Adam Spiers  writes:
>> >  
>> >> - The whole cluster is shut down cleanly.
>> >>
>> >> - The whole cluster is then started up again.  (Side question:
>> >> what
>> >>happens if the last node to shut down is not the first to
>> >> start up?
>> >>How will the cluster ensure it has the most recent version of
>> >> the
>> >>CIB?  Without that, how would it know whether the last man
>> >> standing
>> >>was shut down cleanly or not?)  
>> >
>> > This is my opinion, I don't really know what the "official"
>> > pacemaker
>> > stance is: There is no such thing as shutting down a cluster
>> > cleanly. A
>> > cluster is a process stretching over multiple nodes - if they all
>> > shut
>> > down, the process is gone. When you start up again, you
>> > effectively have
>> > a completely new cluster.  
>> 
>>  Sorry, I don't follow you at all here.  When you start the cluster
>>  up
>>  again, the cluster config from before the shutdown is still there.
>>  That's very far from being a completely new cluster :-)  
>> >>>
>> >>> The problem is you cannot "start the cluster" in pacemaker; you can
>> >>> only "start nodes". The nodes will come up one by one. As opposed (as
>> >>> I had said) to HP Sertvice Guard, where there is a "cluster formation
>> >>> timeout". That is, the nodes wait for the specified time for the
>> >>> cluster to "form". Then the cluster starts as a whole. Of course that
>> >>> only applies if the whole cluster was down, not if a single node was
>> >>> down.  
>> >>
>> >> I'm not sure what that would specifically entail, but I'm guessing we
>> >> have some of the pieces already:
>> >>
>> >> - Corosync has a wait_for_all option if you want the cluster to be
>> >> unable to have quorum at start-up until every node has joined. I don't
>> >> think you can set a timeout that cancels it, though.
>> >>
>> >> - Pacemaker will wait dc-deadtime for the first DC election to
>> >> complete. (if I understand it correctly ...)
>> >>
>> >> - Higher-level tools can start or stop all nodes together (e.g. pcs has
>> >> pcs cluster start/stop --all).  
>> > 
>> > Based on this discussion, I have some questions about pcs:
>> > 
>> > * how is it shutting down the cluster when issuing "pcs cluster stop
>> > --all"?  
>> 
>> First, it sends a request to each node to stop pacemaker. The requests 
>> are sent in parallel which prevents resources from being moved from node 
>> to node. Once pacemaker stops on all nodes, corosync is stopped on all 
>> nodes in the same manner.
> 
> What if for some external reasons one node is slower (load, network, 
> whatever)
> than the others and start reacting ? Sending queries in parallel doesn't
> feels safe enough in regard with all the race conditions that can occurs in 
> the
> same time.
> 
> Am I missing something ?

I can only agree that this type of "cluster ghutdown" is unclean, leaving each 
node with a different CIB most likely (and many aborted transitions).

> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Ulrich Windl


>>> "Gao,Yan"  schrieb am 01.12.2017 um 20:36 in Nachricht
:
> On 11/30/2017 06:48 PM, Andrei Borzenkov wrote:
>> 30.11.2017 16:11, Klaus Wenninger пишет:
>>> On 11/30/2017 01:41 PM, Ulrich Windl wrote:

>>> "Gao,Yan"  schrieb am 30.11.2017 um 11:48 in Nachricht
 :
> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>> corosync and forcing STONITH pacemaker was not started after reboot.
>> In logs I see during boot
>>
>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>> just fenced by sapprod01p for sapprod01p
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>> process (3151) can no longer be respawned,
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
> Pacemaker
>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>> stonith with SBD always takes msgwait (at least, visually host is not
>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>> and is up and running long before timeout expires.
 As msgwait was intended for the message to arrive, and not for the reboot

> time (I guess), this just shows a fundamental problem in SBD design: Receipt

> of the fencing command is not confirmed (other than by seeing the 
> consequences of ist execution).
>>>
>>> The 2 x msgwait is not for confirmations but for writing the poison-pill
>>> and for
>>> having it read by the target-side.
>> 
>> Yes, of course, but that's not what Urlich likely intended to say.
>> msgwait must account for worst case storage path latency, while in
>> normal cases it happens much faster. If fenced node could acknowledge
>> having been killed after reboot, stonith agent could return success much
>> earlier.
> How could an alive man be sure he died before? ;)

I meant: There are three delays:
1) The delay until data is on the disk
2) Delay until date is read from the disk
3) Delay until Host was killed

A confirmation before 3) could shorten the total wait that includes 2) and 3),
right?

Regards,
Ulrich


> 
> Regards,
>Yan
> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Gao,Yan

On 12/05/2017 08:57 AM, Dejan Muhamedagic wrote:

On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:

04.12.2017 14:48, Gao,Yan пишет:

On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:

30.11.2017 13:48, Gao,Yan пишет:

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:

SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot

Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

I think I have seen similar report already. Is it something that can
be fixed by SBD/pacemaker tuning?

SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.



I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
disk at all.

It simply waits that long on startup before starting the rest of the
cluster stack to make sure the fencing that targeted it has returned. It
intentionally doesn't watch anything during this period of time.



Unfortunately it waits too long.

ha1:~ # systemctl status sbd.service
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
4min 16s ago
   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
status=0/SUCCESS)
   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=killed, signa
  Main PID: 1792 (code=exited, status=0/SUCCESS)

дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
daemon...
дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
Terminating.
дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
fencing daemon.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.

But the real problem is - in spite of SBD failed to start, the whole
cluster stack continues to run; and because SBD blindly trusts in well
behaving nodes, fencing appears to succeed after timeout ... without
anyone taking any action on poison pill ...


That's something I always wondered about: if a node is capable of
reading a poison pill then it could before shutdown also write an
"I'm leaving" message into its slot. Wouldn't that make sbd more
reliable? Any reason not to implement that?
Probably it's not considered necessary :) SBD is a fencing mechanism 
which only needs to ensure fencing works. SBD on the fencing target is 
either there eating the pill or getting reset by watchdog, otherwise 
it's not there which is supposed to imply the whole cluster stack is not 
running so that it doesn't need to actually eat the pill.


How systemd should handle the service dependencies is another topic...

Regards,
  Yan





Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: questions about startup fencing

2017-12-05 Thread Jan Pokorný
On 05/12/17 10:01 +0100, Tomas Jelinek wrote:
> The first attempt to fix the issue was to put nodes into standby mode with
> --lifetime=reboot:
> https://github.com/ClusterLabs/pcs/commit/ea6f37983191776fd46d90f22dc1432e0bfc0b91
> 
> This didn't work for several reasons. One of them was back then there was no
> reliable way to set standby mode with --lifetime=reboot for more than one
> node in a single step. (This may have been fixed in the meantime.) There
> were however other serious reasons for not putting the nodes into standby as
> was explained by Andrew:
> - it [putting the nodes into standby first] means shutdown takes longer (no
> node stops until all the resources stop)
> - it makes shutdown more complex (== more fragile), eg...
> - it result in pcs waiting forever for resources to stop
>   - if a stop fails and the cluster is configured to start at boot, then the
> node will get fenced and happily run resources when it returns
> (because all the nodes are up so we still have quorum)

Isn't one-off stopping of a cluster without actually disabling cluster
software to run on boot rather antithetical?

And beside, isn't this ressurection scenario possible also with the
current parallel (hence subject to race condition) stop in such case,
anyway?

> - only potentially benefits resources that have no (or very few) dependants
> and can stop quicker than it takes pcs to get through its "initiate parallel
> shutdown" loop (which should be rather fast since there is no ssh connection
> setup overheads)
> 
> So we ended up with just stopping pacemaker in parallel:
> https://github.com/ClusterLabs/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d

-- 
Jan (Poki)


pgpfTSpu5jsUE.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Gao,Yan

On 12/04/2017 07:55 PM, Andrei Borzenkov wrote:

04.12.2017 14:48, Gao,Yan пишет:

On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:

30.11.2017 13:48, Gao,Yan пишет:

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:

SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot

Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

I think I have seen similar report already. Is it something that can
be fixed by SBD/pacemaker tuning?

SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.



I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
disk at all.

It simply waits that long on startup before starting the rest of the
cluster stack to make sure the fencing that targeted it has returned. It
intentionally doesn't watch anything during this period of time.



Unfortunately it waits too long.

ha1:~ # systemctl status sbd.service
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
4min 16s ago
   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
status=0/SUCCESS)
   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=killed, signa
  Main PID: 1792 (code=exited, status=0/SUCCESS)

дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
daemon...
дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
Terminating.
дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
fencing daemon.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.

But the real problem is - in spite of SBD failed to start, the whole
cluster stack continues to run; and because SBD blindly trusts in well
behaving nodes, fencing appears to succeed after timeout ... without
anyone taking any action on poison pill ...
Start of sbd reaches systemd's timeout for starting units and systemd 
proceeds...


TimeoutStartSec should be configured in sbd.service accordingly to be 
longer than msgwait.


Regards,
  Yan




ha1:~ # systemctl show sbd.service -p RequiredBy
RequiredBy=corosync.service

but

ha1:~ # systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; static;
vendor preset: disabled)
Active: active (running) since Mon 2017-12-04 21:45:33 MSK; 7min ago
  Docs: man:corosync
man:corosync.conf
man:corosync_overview
   Process: 1860 ExecStop=/usr/share/corosync/corosync stop (code=exited,
status=0/SUCCESS)
   Process: 2059 ExecStart=/usr/share/corosync/corosync start
(code=exited, status=0/SUCCESS)
  Main PID: 2073 (corosync)
 Tasks: 2 (limit: 4915)
CGroup: /system.slice/corosync.service
└─2073 corosync

and

ha1:~ # crm_mon -1r
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Mon Dec  4 21:53:24 2017
Last change: Mon Dec  4 21:47:25 2017 by hacluster via crmd on ha1

2 nodes configured
1 resource configured

Online: [ ha1 ha2 ]

Full list of resources:

  stonith-sbd   (stonith:external/sbd): Started ha1

and if I now sever connection between two nodes I will get two single
node clusters each believing it won ...

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pcs create master/slave resource doesn't work (Ken Gaillot)

2017-12-05 Thread Hui Xiang
Thank you very much Ken!! You nailed it, now it's working :-)

On Tue, Dec 5, 2017 at 5:29 AM, Ken Gaillot  wrote:

> On Mon, 2017-12-04 at 23:15 +0800, Hui Xiang wrote:
> > Thanks Ken very much for the helpful information. It indeed help a
> > lot for debbuging.
> >
> >  " Each time the DC decides what to do, there will be a line like
> > "...
> > saving inputs in ..." with a file name. The log messages just before
> > that may give some useful information."
> >   - I am unable to find such information in the logs, it only prints
> > some like /var/lib/pacemaker/pengine/pe-input-xx
>
> If the cluster had nothing to do, it won't show anything, but if
> actions were needed, it should show them, like
> "Start  myrsc ( node1 )".
>
> Are there any messages with "error" or "warning" in the log?
>
> > When I am comparing the cib.xml file of good with bad one, it
> > diffetiates from the order of "name" and "id" as below shown, does it
> > matter for cib to function normally?
>
> No, the XML attributes can be any order.
>
> I just noticed that your cluster has symmetric-cluster=false. That
> means that resources can't run anywhere by default; in order for a
> resource to run, there must be a location constraint allowing it to run
> on a node. Have you added such constraints?
>
> >
> >   
> >  > name="monitor" timeout="30"/>
> >  > timeout="60"/>
> >  > timeout="60"/>
> >  > name="promote" timeout="60"/>
> >  > name="demote" timeout="60"/>
> >   
> >
> >   
> >  > timeout="30"  id="ovndb-servers-monitor-20"/>
> > 
> > 
> >   
> >
> >
> > Thanks.
> > Hui.
> >
> >
> > On Sat, Dec 2, 2017 at 5:07 AM, Ken Gaillot 
> > wrote:
> > > On Fri, 2017-12-01 at 09:36 +0800, Hui Xiang wrote:
> > > > Hi all,
> > > >
> > > >   I am using the ovndb-servers ocf agent[1] which is a kind of
> > > multi-
> > > > state resource,when I am creating it(please see my previous
> > > email),
> > > > the monitor is called only once, and the start operation is never
> > > > called, according to below description, the once called monitor
> > > > operation returned OCF_NOT_RUNNING, should the pacemaker will
> > > decide
> > > > to execute start action based this return code? is there any way
> > > to
> > >
> > > Before Pacemaker does anything with a resource, it first calls a
> > > one-
> > > time monitor (called a "probe") to find out the current status of
> > > the
> > > resource across the cluster. This allows it to discover if the
> > > service
> > > is already running somewhere.
> > >
> > > So, you will see those probes for every resource when the cluster
> > > starts, or when the resource is added to the configuration, or when
> > > the
> > > resource is cleaned up.
> > >
> > > > check out what is the next action? Currently in my environment
> > > > nothing happened and I am almost tried all I known ways to debug,
> > > > however, no lucky, could anyone help it out? thank you very much.
> > > >
> > > > Monitor Return Code   Description
> > > > OCF_NOT_RUNNING   Stopped
> > > > OCF_SUCCESS   Running (Slave)
> > > > OCF_RUNNING_MASTERRunning (Master)
> > > > OCF_FAILED_MASTER Failed (Master)
> > > > Other Failed (Slave)
> > > >
> > > >
> > > > [1] https://github.com/openvswitch/ovs/blob/master/ovn/utilities/
> > > ovnd
> > > > b-servers.ocf
> > > > Hui.
> > > >
> > > >
> > > >
> > > > On Thu, Nov 30, 2017 at 6:39 PM, Hui Xiang 
> > > > wrote:
> > > > > The really weired thing is that the monitor is only called once
> > > > > other than expected repeatedly, where should I check for it?
> > > > >
> > > > > On Thu, Nov 30, 2017 at 4:14 PM, Hui Xiang  > > >
> > > > > wrote:
> > > > > > Thanks Ken very much for your helpful infomation.
> > > > > >
> > > > > > I am now blocking on I can't see the pacemaker DC do any
> > > further
> > > > > > start/promote etc action on my resource agents, no helpful
> > > logs
> > > > > > founded.
> > >
> > > Each time the DC decides what to do, there will be a line like "...
> > > saving inputs in ..." with a file name. The log messages just
> > > before
> > > that may give some useful information.
> > >
> > > Otherwise, you can take that file, and simulate what the cluster
> > > decided at that point:
> > >
> > >   crm_simulate -Sx $FILENAME
> > >
> > > It will first show the status of the cluster at the start of the
> > > decision-making, then a "Transition Summary" with the actions that
> > > are
> > > required, then a simulated execution of those actions, and then
> > > what
> > > the resulting status would be if those actions succeeded.
> > >
> > > That may give you some more information. You can make it more
> > > verbose
> > > by using "-Ssx", or by adding "-", but it's not very user-
> > > friendly
> > > output.
> > >
> > > > > >
> > > > > > So my first question is 

Re: [ClusterLabs] Antw: Re: questions about startup fencing

2017-12-05 Thread Jehan-Guillaume de Rorthais
On Tue, 5 Dec 2017 10:05:03 +0100
Tomas Jelinek  wrote:

> Dne 4.12.2017 v 17:21 Jehan-Guillaume de Rorthais napsal(a):
> > On Mon, 4 Dec 2017 16:50:47 +0100
> > Tomas Jelinek  wrote:
> >   
> >> Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):  
> >>> On Mon, 4 Dec 2017 12:31:06 +0100
> >>> Tomas Jelinek  wrote:
> >>>  
>  Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):  
> > On Fri, 01 Dec 2017 16:34:08 -0600
> > Ken Gaillot  wrote:
> > 
> >> On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:  
> >>>
> >>>
>  Kristoffer Gronlund  wrote:  
> > Adam Spiers  writes:
> >
> >> - The whole cluster is shut down cleanly.
> >>
> >> - The whole cluster is then started up again.  (Side question:
> >> what
> >>  happens if the last node to shut down is not the first to
> >> start up?
> >>  How will the cluster ensure it has the most recent version of
> >> the
> >>  CIB?  Without that, how would it know whether the last man
> >> standing
> >>  was shut down cleanly or not?)  
> >
> > This is my opinion, I don't really know what the "official"
> > pacemaker
> > stance is: There is no such thing as shutting down a cluster
> > cleanly. A
> > cluster is a process stretching over multiple nodes - if they all
> > shut
> > down, the process is gone. When you start up again, you
> > effectively have
> > a completely new cluster.  
> 
>  Sorry, I don't follow you at all here.  When you start the cluster
>  up
>  again, the cluster config from before the shutdown is still there.
>  That's very far from being a completely new cluster :-)  
> >>>
> >>> The problem is you cannot "start the cluster" in pacemaker; you can
> >>> only "start nodes". The nodes will come up one by one. As opposed (as
> >>> I had said) to HP Sertvice Guard, where there is a "cluster formation
> >>> timeout". That is, the nodes wait for the specified time for the
> >>> cluster to "form". Then the cluster starts as a whole. Of course that
> >>> only applies if the whole cluster was down, not if a single node was
> >>> down.  
> >>
> >> I'm not sure what that would specifically entail, but I'm guessing we
> >> have some of the pieces already:
> >>
> >> - Corosync has a wait_for_all option if you want the cluster to be
> >> unable to have quorum at start-up until every node has joined. I don't
> >> think you can set a timeout that cancels it, though.
> >>
> >> - Pacemaker will wait dc-deadtime for the first DC election to
> >> complete. (if I understand it correctly ...)
> >>
> >> - Higher-level tools can start or stop all nodes together (e.g. pcs has
> >> pcs cluster start/stop --all).  
> >
> > Based on this discussion, I have some questions about pcs:
> >
> > * how is it shutting down the cluster when issuing "pcs cluster stop
> > --all"?  
> 
>  First, it sends a request to each node to stop pacemaker. The requests
>  are sent in parallel which prevents resources from being moved from node
>  to node. Once pacemaker stops on all nodes, corosync is stopped on all
>  nodes in the same manner.  
> >>>
> >>> What if for some external reasons one node is slower (load, network,
> >>> whatever) than the others and start reacting ? Sending queries in parallel
> >>> doesn't feels safe enough in regard with all the race conditions that can
> >>> occurs in the same time.
> >>>
> >>> Am I missing something ?
> >>>  
> >>
> >> If a node gets the request later than others, some resources may be
> >> moved to it before it starts shutting down pacemaker as well. Pcs waits
> >> for all nodes to shutdown pacemaker before it moves to shutting down
> >> corosync. This way, quorum is maintained the whole time pacemaker is
> >> shutting down and therefore no services are blocked from stopping due to
> >> lack of quorum.  
> > 
> > OK, so if admins or RA expect to start in, the same conditions the cluster
> > was shut downed, we have to take care of the shutdown ourselves by hands.
> > Considering disabling the resource before shutting down might be the best
> > option in the situation as the CRM will take care of switching off things
> > correctly in a proper transition.  
> 
> My understanding is that pacemaker takes care of switching off things 
> correctly in a proper transition on its shutdown. So there should be no 
> extra care needed. Pacemaker developers, however, need to confirm that.

Sure, but then, the resource would move away from the node if some other
node(s) (with appropriate 

Re: [ClusterLabs] Adding Tomcat as a resource to a Cluster on CentOS 7

2017-12-05 Thread Oyvind Albrigtsen

On 04/12/17 16:29 +, Sean Beeson wrote:

Thank you for the replay, Oyvind. I gave it plenty of time to start up.
using  tomcat_name="tomcat" it starts what I can only call a lifeless PID,
but it never seems to actually start up. The catalina.out file is never
touched, so it never has anything in it to indicate a problem. Pacemaker
does seem to be managing it though because, although this PID shows, it
will report it as not running and then move everything to the other node.
It will do that a couple times, but that eventually stops as well.

Try "pcs resource disable " and then "pcs resource
debug-start --full ". The last command will start the
resource and show you which commands are run, so you can troubleshoot
why it's failing.


Kind regards,

Sean

On Thu, Nov 30, 2017 at 10:40 PM Oyvind Albrigtsen 
wrote:


Tomcat can be very slow at startup depending on the modules you use,
so you can either disable modules you arent using to make it start
faster or set a higher start timeout via "pcs resource  op
start interval=".

On 30/11/17 13:26 +, Sean Beeson wrote:
>Hi, list.
>
>This is a pretty basic question. I have gone through what I could find on
>setting up Tomcat service as a resource to a cluster, but did not find
>exactly the issue I am having. Sorry, if it has been covered before.
>
>I am attempting this on centos-release-7-4.1708.el7.centos.x86_64.
>The pcs I have installed is pcs-0.9.158-6.el7.centos.x86_64
>The resource-agents installed is resource-agents-3.9.5-105.el7_4.2.x86_64
>
>I have DRBD, MySql, and a virtual IP running spectacularly well and they
>failover perfectly and do exactly what I want them. I can add Tomcat as a
>resource just fine, but it never starts and I can not fined anything in
any
>log file that indicates why. Pcs does at some point know to check on it,
>but simply says Tomcat is not running. If I run everything manually on in
a
>cluster I can manually get Tomcat to start with systemctl. Here is how I
am
>try to configure it.
>
>[root@centos7-ha-lab-01 ~]# pcs status
>Cluster name: ha-cluster
>Stack: corosync
>Current DC: centos7-ha-lab-02-cr (version 1.1.16-12.el7_4.4-94ff4df) -
>partition with quorum
>Last updated: Thu Nov 30 21:03:36 2017
>Last change: Thu Nov 30 20:53:37 2017 by root via cibadmin on
>centos7-ha-lab-01-cr
>
>2 nodes configured
>6 resources configured
>
>Online: [ centos7-ha-lab-01-cr centos7-ha-lab-02-cr ]
>
>Full list of resources:
>
> Master/Slave Set: DRBD_data_clone [DRBD_data]
> Masters: [ centos7-ha-lab-01-cr ]
> Slaves: [ centos7-ha-lab-02-cr ]
> fsDRBD_data(ocf::heartbeat:Filesystem):Started
centos7-ha-lab-01-cr
> OuterDB_Service(systemd:mysqld):Started centos7-ha-lab-01-cr
> OuterDB_VIP(ocf::heartbeat:IPaddr2):Started centos7-ha-lab-01-cr
> tomcat_OuterWeb(ocf::heartbeat:tomcat):Stopped
>
>Failed Actions:
>* tomcat_OuterWeb_start_0 on centos7-ha-lab-01-cr 'unknown error' (1):
>call=67, status=Timed Out, exitreason='none',
>last-rc-change='Thu Nov 30 20:56:22 2017', queued=0ms, exec=180003ms
>* tomcat_OuterWeb_start_0 on centos7-ha-lab-02-cr 'unknown error' (1):
>call=57, status=Timed Out, exitreason='none',
>last-rc-change='Thu Nov 30 20:53:23 2017', queued=0ms, exec=180003ms
>
>Daemon Status:
>  corosync: active/enabled
>  pacemaker: active/enabled
>  pcsd: active/enabled
>
>I have tried with and without tomcat_name=tomcat_OuterWeb and tomcat and
>root for tomcat_user=. Neither work.
>
>Here is the command I am using to add it.
>
>pcs resource create tomcat_OuterWeb ocf:heartbeat:tomcat
>java_home="/opt/java/jre1.7.0_80" catalina_home="/opt/tomcat7"
>catalina_opts="-Dbuild.compiler.emacs=true -Dfile.encoding=UTF-8
>-Djava.util.logging.config.file=/opt/tomcat7/conf/log4j.properties
>-Dlog4j.configuration=file:/opt/tomcat7/conf/log4j.properties -Xms1024m
>-Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=512m" tomcat_user="root" op
>monitor interval="15s" op start timeout="180s"
>
>I have tried also the most basic.
>pcs resource create tomcat_OuterWeb ocf:heartbeat:tomcat
>java_home="/opt/java/jre1.7.0_80" catalina_home="/opt/tomcat7"
>tomcat_name="tomcat_OuterWeb" tomcat_user="root" op monitor interval="15s"
>op start timeout="180s"
>
>I other examples I have seen they usually use params then the options in
>these command to add Tomcat as a resource, but when  I use that it tells
me
>that is an unrecognized option and it then accepts the options without it
>just fine. I was led to think this was a difference in version of the
>resource-agents perhaps.
>
>Any idea why I can not get Tomcat to start or some lead to the logging I
>could look at to understand why it is failing would be great. Nothing
shows
>in messages, catalina.out, pcsd.log, nor the resource
>log--tomcat_OuterWeb.log. However, it does make the resource log, but it
>only has this in it, which seems to be false:
>
>2017/11/30 20:50:22: start ===
>2017/11/30 20:53:22: stop  

Re: [ClusterLabs] Antw: Re: questions about startup fencing

2017-12-05 Thread Tomas Jelinek

Dne 4.12.2017 v 17:21 Jehan-Guillaume de Rorthais napsal(a):

On Mon, 4 Dec 2017 16:50:47 +0100
Tomas Jelinek  wrote:


Dne 4.12.2017 v 14:21 Jehan-Guillaume de Rorthais napsal(a):

On Mon, 4 Dec 2017 12:31:06 +0100
Tomas Jelinek  wrote:
   

Dne 4.12.2017 v 10:36 Jehan-Guillaume de Rorthais napsal(a):

On Fri, 01 Dec 2017 16:34:08 -0600
Ken Gaillot  wrote:
  

On Thu, 2017-11-30 at 07:55 +0100, Ulrich Windl wrote:


 

Kristoffer Gronlund  wrote:

Adam Spiers  writes:
 

- The whole cluster is shut down cleanly.

- The whole cluster is then started up again.  (Side question:
what
     happens if the last node to shut down is not the first to
start up?
     How will the cluster ensure it has the most recent version of
the
     CIB?  Without that, how would it know whether the last man
standing
     was shut down cleanly or not?)


This is my opinion, I don't really know what the "official"
pacemaker
stance is: There is no such thing as shutting down a cluster
cleanly. A
cluster is a process stretching over multiple nodes - if they all
shut
down, the process is gone. When you start up again, you
effectively have
a completely new cluster.


Sorry, I don't follow you at all here.  When you start the cluster
up
again, the cluster config from before the shutdown is still there.
That's very far from being a completely new cluster :-)


The problem is you cannot "start the cluster" in pacemaker; you can
only "start nodes". The nodes will come up one by one. As opposed (as
I had said) to HP Sertvice Guard, where there is a "cluster formation
timeout". That is, the nodes wait for the specified time for the
cluster to "form". Then the cluster starts as a whole. Of course that
only applies if the whole cluster was down, not if a single node was
down.


I'm not sure what that would specifically entail, but I'm guessing we
have some of the pieces already:

- Corosync has a wait_for_all option if you want the cluster to be
unable to have quorum at start-up until every node has joined. I don't
think you can set a timeout that cancels it, though.

- Pacemaker will wait dc-deadtime for the first DC election to
complete. (if I understand it correctly ...)

- Higher-level tools can start or stop all nodes together (e.g. pcs has
pcs cluster start/stop --all).


Based on this discussion, I have some questions about pcs:

* how is it shutting down the cluster when issuing "pcs cluster stop
--all"?


First, it sends a request to each node to stop pacemaker. The requests
are sent in parallel which prevents resources from being moved from node
to node. Once pacemaker stops on all nodes, corosync is stopped on all
nodes in the same manner.


What if for some external reasons one node is slower (load, network,
whatever) than the others and start reacting ? Sending queries in parallel
doesn't feels safe enough in regard with all the race conditions that can
occurs in the same time.

Am I missing something ?
   


If a node gets the request later than others, some resources may be
moved to it before it starts shutting down pacemaker as well. Pcs waits
for all nodes to shutdown pacemaker before it moves to shutting down
corosync. This way, quorum is maintained the whole time pacemaker is
shutting down and therefore no services are blocked from stopping due to
lack of quorum.


OK, so if admins or RA expect to start in, the same conditions the cluster was
shut downed, we have to take care of the shutdown ourselves by hands.
Considering disabling the resource before shutting down might be the best
option in the situation as the CRM will take care of switching off things
correctly in a proper transition.


My understanding is that pacemaker takes care of switching off things 
correctly in a proper transition on its shutdown. So there should be no 
extra care needed. Pacemaker developers, however, need to confirm that.




That's fine to me, as a cluster shutdown should be part of a controlled
procedure. I have to update my online docs I suppose now.

Thank you for your answers!



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: questions about startup fencing

2017-12-05 Thread Tomas Jelinek

Dne 4.12.2017 v 23:17 Ken Gaillot napsal(a):

On Mon, 2017-12-04 at 22:08 +0300, Andrei Borzenkov wrote:

04.12.2017 18:47, Tomas Jelinek пишет:

Dne 4.12.2017 v 16:02 Kristoffer Grönlund napsal(a):

Tomas Jelinek  writes:



* how is it shutting down the cluster when issuing "pcs
cluster stop
--all"?


First, it sends a request to each node to stop pacemaker. The
requests
are sent in parallel which prevents resources from being moved
from node
to node. Once pacemaker stops on all nodes, corosync is stopped
on all
nodes in the same manner.


* any race condition possible where the cib will record only
one
node up before
     the last one shut down?
* will the cluster start safely?


That definitely sounds racy to me. The best idea I can think of
would be
to set all nodes except one in standby, and then shutdown
pacemaker
everywhere...



What issues does it solve? Which node should be the one?

How do you get the nodes out of standby mode on startup?


Is --lifetime=reboot valid for cluster properties? It is accepted by
crm_attribute and actually puts value as transient_attribute.


standby is a node attribute, so lifetime does apply normally.



Right, I forgot about this.

I was dealing with 'pcs cluster stop --all' back in January 2015, so I 
don't remember all the details anymore. However, I was able to dig out 
the private email thread where stopping a cluster was discussed with 
pacemaker developers including Andrew Beekhof and David Vossel.


Originally, pcs was stopping nodes in parallel in such a manner that 
each node stopped pacemaker and then corosync independently of other 
nodes. This caused loss of quorum during stopping the cluster, as nodes 
hosting resources which stopped fast disconnected from corosync sooner 
than nodes hosting resources which stopped slowly. Due to quorum 
missing, some resources could not be stopped and the cluster stop 
failed. This is covered in here:

https://bugzilla.redhat.com/show_bug.cgi?id=1180506

The first attempt to fix the issue was to put nodes into standby mode 
with --lifetime=reboot:

https://github.com/ClusterLabs/pcs/commit/ea6f37983191776fd46d90f22dc1432e0bfc0b91

This didn't work for several reasons. One of them was back then there 
was no reliable way to set standby mode with --lifetime=reboot for more 
than one node in a single step. (This may have been fixed in the 
meantime.) There were however other serious reasons for not putting the 
nodes into standby as was explained by Andrew:
- it [putting the nodes into standby first] means shutdown takes longer 
(no node stops until all the resources stop)

- it makes shutdown more complex (== more fragile), eg...
- it result in pcs waiting forever for resources to stop
  - if a stop fails and the cluster is configured to start at boot, 
then the node will get fenced and happily run resources when it returns 
(because all the nodes are up so we still have quorum)
- only potentially benefits resources that have no (or very few) 
dependants and can stop quicker than it takes pcs to get through its 
"initiate parallel shutdown" loop (which should be rather fast since 
there is no ssh connection setup overheads)


So we ended up with just stopping pacemaker in parallel:
https://github.com/ClusterLabs/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d

I hope this shed light on why pcs stops clusters the way it does and 
that standby was considered but rejected for good reasons.


Regards,
Tomas

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not work

2017-12-05 Thread 井上 和徳
Hi Ken,

Thank you for your comment. ("cibadmin --empty" is interesting.)

I registered in CLBZ :
https://bugs.clusterlabs.org/show_bug.cgi?id=5331

Best Regards

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Saturday, December 02, 2017 8:02 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] PCMK_node_start_state=standby sometimes does not 
> work
> 
> On Tue, 2017-11-28 at 09:36 +, 井上 和徳 wrote:
> > Hi,
> >
> > Sometimes a node with 'PCMK_node_start_state=standby' will start up
> > Online.
> >
> > [ reproduction scenario ]
> >  * Set 'PCMK_node_start_state=standby' to /etc/sysconfig/pacemaker.
> >  * Delete cib (/var/lib/pacemaker/cib/*).
> >  * Start pacemaker at the same time on 2 nodes.
> >   # for i in rhel74-1 rhel74-3 ; do ssh -f $i systemctl start
> > pacemaker ; done
> >
> > [ actual result ]
> >  * crm_mon
> >   Stack: corosync
> >   Current DC: rhel74-3 (version 1.1.18-2b07d5c) - partition with
> > quorum
> >   Last change: Wed Nov 22 06:22:50 2017 by hacluster via crmd on
> > rhel74-3
> >
> >   2 nodes configured
> >   0 resources configured
> >
> >   Node rhel74-3: standby
> >   Online: [ rhel74-1 ]
> >
> >  * cib.xml
> >   
> > 
> > 
> >   
> >  > value="on"/>
> >   
> > 
> >   
> >
> >  * pacemaker.log
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: (cib_native.c:462 )
> > warning: cib_native_perform_op_delegate:Call failed: No such
> > device or address
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update    > id="3232261507">
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update  > es id="nodes-3232261507">
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update    > id="nodes-3232261507-standby" name="standby" value="on"/>
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update  > tes>
> >   Nov 22 06:22:50 [20755] rhel74-1   crmd: ( cib_attrs.c:320
> > )info: update_attr_delegate:Update   
> >
> >  * I attached crm_report to GitHub (too big to attach to this email),
> > so look at it.
> >    https://github.com/inouekazu/pcmk_report/blob/master/pcmk-Wed-22-N
> > ov-2017.tar.bz2
> >
> >
> > I think that the additional timing of *1 and
> > *2 is the cause.
> > *1 '
> > *2 
> >   > value="on"/>
> >
> > I expect to be fixed, but if it's difficult, I have two questions.
> > 1) Does this only occur if there is no cib.xml (in other words, there
> > is no  element)?
> 
> I believe so. I think this is the key message:
> 
> Nov 22 06:22:50 [20750] rhel74-1cib: ( callbacks.c:1101  )
> warning: cib_process_request:Completed cib_modify operation for
> section nodes: No such device or address (rc=-6, origin=rhel74-
> 1/crmd/12, version=0.3.0)
> 
> PCMK_node_start_state works by setting the "standby" node attribute in
> the CIB. However, it does this via a "modify" command that assumes the
>  tag already exists.
> 
> If there is no CIB, pacemaker will quickly create one -- but in this
> case, the node tries to set the attribute before that's happened.
> 
> Hopefully we can come up with a fix. If you want, you can file a bug
> report at bugs.clusterlabs.org, to track the progress.
> 
> > 2) Is there any workaround other than "Do not start at the same
> > time"?
> >
> > Best Regards
> 
> Before starting pacemaker, if /var/lib/pacemaker/cib is empty, you can
> create a skeleton CIB with:
> 
>  cibadmin --empty > /var/lib/pacemaker/cib/cib.xml
> 
> That will include an empty  tag, and the modify command should
> work when pacemaker starts.
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Dejan Muhamedagic
On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:
> 04.12.2017 14:48, Gao,Yan пишет:
> > On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
> >> 30.11.2017 13:48, Gao,Yan пишет:
> >>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>  SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>  VM on VSphere using shared VMDK as SBD. During basic tests by killing
>  corosync and forcing STONITH pacemaker was not started after reboot.
>  In logs I see during boot
> 
>  Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>  just fenced by sapprod01p for sapprod01p
>  Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>  process (3151) can no longer be respawned,
>  Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>  Pacemaker
> 
>  SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>  stonith with SBD always takes msgwait (at least, visually host is not
>  declared as OFFLINE until 120s passed). But VM rebots lightning fast
>  and is up and running long before timeout expires.
> 
>  I think I have seen similar report already. Is it something that can
>  be fixed by SBD/pacemaker tuning?
> >>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
> >>>
> >>
> >> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
> >> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
> >> disk at all. 
> > It simply waits that long on startup before starting the rest of the
> > cluster stack to make sure the fencing that targeted it has returned. It
> > intentionally doesn't watch anything during this period of time.
> > 
> 
> Unfortunately it waits too long.
> 
> ha1:~ # systemctl status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
>Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
>Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
> 4min 16s ago
>   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
> status=0/SUCCESS)
>   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=killed, signa
>  Main PID: 1792 (code=exited, status=0/SUCCESS)
> 
> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
> daemon...
> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
> Terminating.
> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
> fencing daemon.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
> 
> But the real problem is - in spite of SBD failed to start, the whole
> cluster stack continues to run; and because SBD blindly trusts in well
> behaving nodes, fencing appears to succeed after timeout ... without
> anyone taking any action on poison pill ...

That's something I always wondered about: if a node is capable of
reading a poison pill then it could before shutdown also write an
"I'm leaving" message into its slot. Wouldn't that make sbd more
reliable? Any reason not to implement that?

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org