Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-20 Thread Andrei Borzenkov
18.02.2019 18:53, Ken Gaillot пишет:
> On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote:
>> 17.02.2019 0:33, Andrei Borzenkov пишет:
>>> 17.02.2019 0:03, Eric Robinson пишет:
 Here are the relevant corosync logs.

 It appears that the stop action for resource p_mysql_002 failed,
 and that caused a cascading series of service changes. However, I
 don't understand why, since no other resources are dependent on
 p_mysql_002.

>>>
>>> You have mandatory colocation constraints for each SQL resource
>>> with
>>> VIP. it means that to move SQL resource to another node pacemaker
>>> also
>>> must move VIP to another node which in turn means it needs to move
>>> all
>>> other dependent resources as well.
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:  warning:
 check_migration_threshold:Forcing p_mysql_002 away from
 001db01a after 100 failures (max=100)
>>>
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:   notice:
 LogAction: *
 Stop   p_vip_clust01 (   001db01a
 )   blocked
>>>
>>> ...
 Feb 16 14:06:39 [3912] 001db01apengine:   notice:
 LogAction: *
 Stop   p_mysql_001   (   001db01a )   due
 to colocation with p_vip_clust01
>>
>> There is apparently more in it. Note that p_vip_clust01 operation is
>> "blocked". That is because mandatory order constraint is symmetrical
>> by
>> default, so to move VIP pacemaker needs first to stop it on current
>> node; but before it can stop VIP it needs to (be able to) stop
>> p_mysql_002; but it cannot do it because by default when "stop" fails
>> without stonith, resource is blocked and no further actions are
>> possible
>> - i.e. resource can no more (tried to) be stopped.
> 
> Correct, failed stop actions are special -- an on-fail policy of "stop"
> or "restart" requires a stop, so obviously they can't be applied to
> failed stops. As you mentioned, without fencing, on-fail defaults to
> "block" for stops, which should freeze the resource as it is.
> 
>> I still consider is rather questionable behavior. I tried to
>> reproduce
>> it and I see the same.
>>
>> 1. After this happens resource p_mysql_002 has target=Stopped in CIB.
>> Why, oh why, pacemaker tries to "force away" resource that is not
>> going
>> to be started on another node anyway?
> 
> Without having the policy engine inputs, I can't be sure, but I suspect
> p_mysql_002 is not being forced away, but its failure causes that node
> to be less preferred for the resources it depends on.
> 
>> 2. pacemaker knows that it cannot stop (and hence move)
>> p_vip_clust01,
>> still it happily will stop all resources that depend on it in
>> preparation to move them and leave them at that because it cannot
>> move
> 
> I think this is the point at which the behavior is undesirable, because
> it would be relevant whether the move was related to the blocked
> failure or not. Feel free to open a bug report and attach the relevant
> policy engine input (or a crm_report).
> 

https://bugs.clusterlabs.org/show_bug.cgi?id=5379

>> them. Resources are neither restarted on current node, nor moved to
>> another node. At this point I'd expect pacemaker to be smart enough
>> and
>> not even initiate actions that are known to be unsuccessful.
>>
>> The best we can do at this point is set symmetrical=false which
>> allows
>> move to actually happen, but it still means downtime for resources
>> that
>> are moved and has its own can of worms in normal case.
> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Andrei Borzenkov
19.02.2019 23:06, Eric Robinson пишет:
...
> Bottom line is, how do we configure the cluster in such a way that
> there are no cascading circumstances when a MySQL resource fails?
> Basically, if a MySQL resource fails, it fails. We'll deal with that
> on an ad-hoc basis. I don't want the whole cluster to barf.
...
> This is probably a dumb question, but can we remove just the monitor 
> operation but leave the resource configured in the cluster? If a node fails 
> over, we do want the resources to start automatically on the new primary node.

While you can do it, the problem discussed in this thread was caused by
failure to stop resource, not resource failure during normal operation.
Logs you provided started with


Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:
+  /cib:  @epoch=346, @num_updates=0
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:
++ /cib/configuration/resources/primitive[@id='p_mysql_002']:

Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:
++   
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:
++


so apparently administrator decided to stop this MySQL instance (I am
not sure if pacemaker keeps or logs origin of CIB change or if it is
even possible to determine it).

So removing monitor operation would not help with this. You probably
still need to set on-failure=ignore for each operation on MySQL
resources to get desired behavior.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Ken Gaillot
On Tue, 2019-02-19 at 20:06 +, Eric Robinson wrote:
> > -Original Message-
> > From: Users  On Behalf Of Ken
> > Gaillot
> > Sent: Tuesday, February 19, 2019 10:31 AM
> > To: Cluster Labs - All topics related to open-source clustering
> > welcomed
> > 
> > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When
> > Just One
> > Fails?
> > 
> > On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:
> > > > -Original Message-
> > > > From: Users  On Behalf Of Andrei
> > > > Borzenkov
> > > > Sent: Sunday, February 17, 2019 11:56 AM
> > > > To: users@clusterlabs.org
> > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When
> > > > Just
> > > > One Fails?
> > > > 
> > > > 17.02.2019 0:44, Eric Robinson пишет:
> > > > > Thanks for the feedback, Andrei.
> > > > > 
> > > > > I only want cluster failover to occur if the filesystem or
> > > > > drbd
> > > > > resources fail,
> > > > 
> > > > or if the cluster messaging layer detects a complete node
> > > > failure.
> > > > Is there a
> > > > way to tell PaceMaker not to trigger a cluster failover if any
> > > > of
> > > > the p_mysql resources fail?
> > > > > 
> > > > 
> > > > Let's look at this differently. If all these applications
> > > > depend on
> > > > each other, you should not be able to stop individual resource
> > > > in
> > > > the first place - you need to group them or define dependency
> > > > so
> > > > that stopping any resource would stop everything.
> > > > 
> > > > If these applications are independent, they should not share
> > > > resources.
> > > > Each MySQL application should have own IP and own FS and own
> > > > block
> > > > device for this FS so that they can be moved between cluster
> > > > nodes
> > > > independently.
> > > > 
> > > > Anything else will lead to troubles as you already observed.
> > > 
> > > FYI, the MySQL services do not depend on each other. All of them
> > > depend on the floating IP, which depends on the filesystem, which
> > > depends on DRBD, but they do not depend on each other. Ideally,
> > > the
> > > failure of p_mysql_002 should not cause failure of other mysql
> > > resources, but now I understand why it happened. Pacemaker wanted
> > > to
> > > start it on the other node, so it needed to move the floating IP,
> > > filesystem, and DRBD primary, which had the cascade effect of
> > > stopping
> > > the other MySQL resources.
> > > 
> > > I think I also understand why the p_vip_clust01 resource blocked.
> > > 
> > > FWIW, we've been using Linux HA since 2006, originally Heartbeat,
> > > but
> > > then Corosync+Pacemaker. The past 12 years have been relatively
> > > problem free. This symptom is new for us, only within the past
> > > year.
> > > Our cluster nodes have many separate instances of MySQL running,
> > > so it
> > > is not practical to have that many filesystems, IPs, etc. We are
> > > content with the way things are, except for this new troubling
> > > behavior.
> > > 
> > > If I understand the thread correctly, op-fail=stop will not work
> > > because the cluster will still try to stop the resources that are
> > > implied dependencies.
> > > 
> > > Bottom line is, how do we configure the cluster in such a way
> > > that
> > > there are no cascading circumstances when a MySQL resource fails?
> > > Basically, if a MySQL resource fails, it fails. We'll deal with
> > > that
> > > on an ad-hoc basis. I don't want the whole cluster to barf. What
> > > about
> > > op-fail=ignore? Earlier, you suggested symmetrical=false might
> > > also do
> > > the trick, but you said it comes with its own can or worms.
> > > What are the downsides with op-fail=ignore or asymmetrical=false?
> > > 
> > > --Eric
> > 
> > Even adding on-fail=ignore to the recurring monitors may not do
> > what you
> > want, because I suspect that even an ignored failure will make the
> > node less
> > preferable for all the other resources. But it's worth testing.
> > 
> > Otherwise, your best option is to remove all the recurring monitors
> > from the
> > mysql resources, and rely on external monitoring (e.g. nagios,
> > icinga, monit,
> > ...) to detect problems.
> 
> This is probably a dumb question, but can we remove just the monitor
> operation but leave the resource configured in the cluster? If a node
> fails over, we do want the resources to start automatically on the
> new primary node.

Yes, operations can be added/removed without affecting the
configuration of the resource itself.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Ken Gaillot
> Sent: Tuesday, February 19, 2019 10:31 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:
> > > -Original Message-
> > > From: Users  On Behalf Of Andrei
> > > Borzenkov
> > > Sent: Sunday, February 17, 2019 11:56 AM
> > > To: users@clusterlabs.org
> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just
> > > One Fails?
> > >
> > > 17.02.2019 0:44, Eric Robinson пишет:
> > > > Thanks for the feedback, Andrei.
> > > >
> > > > I only want cluster failover to occur if the filesystem or drbd
> > > > resources fail,
> > >
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a
> > > way to tell PaceMaker not to trigger a cluster failover if any of
> > > the p_mysql resources fail?
> > > >
> > >
> > > Let's look at this differently. If all these applications depend on
> > > each other, you should not be able to stop individual resource in
> > > the first place - you need to group them or define dependency so
> > > that stopping any resource would stop everything.
> > >
> > > If these applications are independent, they should not share
> > > resources.
> > > Each MySQL application should have own IP and own FS and own block
> > > device for this FS so that they can be moved between cluster nodes
> > > independently.
> > >
> > > Anything else will lead to troubles as you already observed.
> >
> > FYI, the MySQL services do not depend on each other. All of them
> > depend on the floating IP, which depends on the filesystem, which
> > depends on DRBD, but they do not depend on each other. Ideally, the
> > failure of p_mysql_002 should not cause failure of other mysql
> > resources, but now I understand why it happened. Pacemaker wanted to
> > start it on the other node, so it needed to move the floating IP,
> > filesystem, and DRBD primary, which had the cascade effect of stopping
> > the other MySQL resources.
> >
> > I think I also understand why the p_vip_clust01 resource blocked.
> >
> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> > then Corosync+Pacemaker. The past 12 years have been relatively
> > problem free. This symptom is new for us, only within the past year.
> > Our cluster nodes have many separate instances of MySQL running, so it
> > is not practical to have that many filesystems, IPs, etc. We are
> > content with the way things are, except for this new troubling
> > behavior.
> >
> > If I understand the thread correctly, op-fail=stop will not work
> > because the cluster will still try to stop the resources that are
> > implied dependencies.
> >
> > Bottom line is, how do we configure the cluster in such a way that
> > there are no cascading circumstances when a MySQL resource fails?
> > Basically, if a MySQL resource fails, it fails. We'll deal with that
> > on an ad-hoc basis. I don't want the whole cluster to barf. What about
> > op-fail=ignore? Earlier, you suggested symmetrical=false might also do
> > the trick, but you said it comes with its own can or worms.
> > What are the downsides with op-fail=ignore or asymmetrical=false?
> >
> > --Eric
> 
> Even adding on-fail=ignore to the recurring monitors may not do what you
> want, because I suspect that even an ignored failure will make the node less
> preferable for all the other resources. But it's worth testing.
> 
> Otherwise, your best option is to remove all the recurring monitors from the
> mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit,
> ...) to detect problems.

This is probably a dumb question, but can we remove just the monitor operation 
but leave the resource configured in the cluster? If a node fails over, we do 
want the resources to start automatically on the new primary node.

> --
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Ken Gaillot
On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote:
> > -Original Message-
> > From: Users  On Behalf Of Andrei
> > Borzenkov
> > Sent: Sunday, February 17, 2019 11:56 AM
> > To: users@clusterlabs.org
> > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When
> > Just One
> > Fails?
> > 
> > 17.02.2019 0:44, Eric Robinson пишет:
> > > Thanks for the feedback, Andrei.
> > > 
> > > I only want cluster failover to occur if the filesystem or drbd
> > > resources fail,
> > 
> > or if the cluster messaging layer detects a complete node failure.
> > Is there a
> > way to tell PaceMaker not to trigger a cluster failover if any of
> > the p_mysql
> > resources fail?
> > > 
> > 
> > Let's look at this differently. If all these applications depend on
> > each other,
> > you should not be able to stop individual resource in the first
> > place - you
> > need to group them or define dependency so that stopping any
> > resource
> > would stop everything.
> > 
> > If these applications are independent, they should not share
> > resources.
> > Each MySQL application should have own IP and own FS and own block
> > device for this FS so that they can be moved between cluster nodes
> > independently.
> > 
> > Anything else will lead to troubles as you already observed.
> 
> FYI, the MySQL services do not depend on each other. All of them
> depend on the floating IP, which depends on the filesystem, which
> depends on DRBD, but they do not depend on each other. Ideally, the
> failure of p_mysql_002 should not cause failure of other mysql
> resources, but now I understand why it happened. Pacemaker wanted to
> start it on the other node, so it needed to move the floating IP,
> filesystem, and DRBD primary, which had the cascade effect of
> stopping the other MySQL resources.
> 
> I think I also understand why the p_vip_clust01 resource blocked. 
> 
> FWIW, we've been using Linux HA since 2006, originally Heartbeat, but
> then Corosync+Pacemaker. The past 12 years have been relatively
> problem free. This symptom is new for us, only within the past year.
> Our cluster nodes have many separate instances of MySQL running, so
> it is not practical to have that many filesystems, IPs, etc. We are
> content with the way things are, except for this new troubling
> behavior.
> 
> If I understand the thread correctly, op-fail=stop will not work
> because the cluster will still try to stop the resources that are
> implied dependencies.
> 
> Bottom line is, how do we configure the cluster in such a way that
> there are no cascading circumstances when a MySQL resource fails?
> Basically, if a MySQL resource fails, it fails. We'll deal with that
> on an ad-hoc basis. I don't want the whole cluster to barf. What
> about op-fail=ignore? Earlier, you suggested symmetrical=false might
> also do the trick, but you said it comes with its own can or worms.
> What are the downsides with op-fail=ignore or asymmetrical=false?
> 
> --Eric

Even adding on-fail=ignore to the recurring monitors may not do what
you want, because I suspect that even an ignored failure will make the
node less preferable for all the other resources. But it's worth
testing.

Otherwise, your best option is to remove all the recurring monitors
from the mysql resources, and rely on external monitoring (e.g. nagios,
icinga, monit, ...) to detect problems.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, February 17, 2019 11:56 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:44, Eric Robinson пишет:
> > Thanks for the feedback, Andrei.
> >
> > I only want cluster failover to occur if the filesystem or drbd resources 
> > fail,
> or if the cluster messaging layer detects a complete node failure. Is there a
> way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql
> resources fail?
> >
> 
> Let's look at this differently. If all these applications depend on each 
> other,
> you should not be able to stop individual resource in the first place - you
> need to group them or define dependency so that stopping any resource
> would stop everything.
> 
> If these applications are independent, they should not share resources.
> Each MySQL application should have own IP and own FS and own block
> device for this FS so that they can be moved between cluster nodes
> independently.
> 
> Anything else will lead to troubles as you already observed.

FYI, the MySQL services do not depend on each other. All of them depend on the 
floating IP, which depends on the filesystem, which depends on DRBD, but they 
do not depend on each other. Ideally, the failure of p_mysql_002 should not 
cause failure of other mysql resources, but now I understand why it happened. 
Pacemaker wanted to start it on the other node, so it needed to move the 
floating IP, filesystem, and DRBD primary, which had the cascade effect of 
stopping the other MySQL resources.

I think I also understand why the p_vip_clust01 resource blocked. 

FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then 
Corosync+Pacemaker. The past 12 years have been relatively problem free. This 
symptom is new for us, only within the past year. Our cluster nodes have many 
separate instances of MySQL running, so it is not practical to have that many 
filesystems, IPs, etc. We are content with the way things are, except for this 
new troubling behavior.

If I understand the thread correctly, op-fail=stop will not work because the 
cluster will still try to stop the resources that are implied dependencies.

Bottom line is, how do we configure the cluster in such a way that there are no 
cascading circumstances when a MySQL resource fails? Basically, if a MySQL 
resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want 
the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested 
symmetrical=false might also do the trick, but you said it comes with its own 
can or worms. What are the downsides with op-fail=ignore or asymmetrical=false?

--Eric






> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-18 Thread Ken Gaillot
On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote:
> 17.02.2019 0:33, Andrei Borzenkov пишет:
> > 17.02.2019 0:03, Eric Robinson пишет:
> > > Here are the relevant corosync logs.
> > > 
> > > It appears that the stop action for resource p_mysql_002 failed,
> > > and that caused a cascading series of service changes. However, I
> > > don't understand why, since no other resources are dependent on
> > > p_mysql_002.
> > > 
> > 
> > You have mandatory colocation constraints for each SQL resource
> > with
> > VIP. it means that to move SQL resource to another node pacemaker
> > also
> > must move VIP to another node which in turn means it needs to move
> > all
> > other dependent resources as well.
> > ...
> > > Feb 16 14:06:39 [3912] 001db01apengine:  warning:
> > > check_migration_threshold:Forcing p_mysql_002 away from
> > > 001db01a after 100 failures (max=100)
> > 
> > ...
> > > Feb 16 14:06:39 [3912] 001db01apengine:   notice:
> > > LogAction: *
> > > Stop   p_vip_clust01 (   001db01a
> > > )   blocked
> > 
> > ...
> > > Feb 16 14:06:39 [3912] 001db01apengine:   notice:
> > > LogAction: *
> > > Stop   p_mysql_001   (   001db01a )   due
> > > to colocation with p_vip_clust01
> 
> There is apparently more in it. Note that p_vip_clust01 operation is
> "blocked". That is because mandatory order constraint is symmetrical
> by
> default, so to move VIP pacemaker needs first to stop it on current
> node; but before it can stop VIP it needs to (be able to) stop
> p_mysql_002; but it cannot do it because by default when "stop" fails
> without stonith, resource is blocked and no further actions are
> possible
> - i.e. resource can no more (tried to) be stopped.

Correct, failed stop actions are special -- an on-fail policy of "stop"
or "restart" requires a stop, so obviously they can't be applied to
failed stops. As you mentioned, without fencing, on-fail defaults to
"block" for stops, which should freeze the resource as it is.

> I still consider is rather questionable behavior. I tried to
> reproduce
> it and I see the same.
> 
> 1. After this happens resource p_mysql_002 has target=Stopped in CIB.
> Why, oh why, pacemaker tries to "force away" resource that is not
> going
> to be started on another node anyway?

Without having the policy engine inputs, I can't be sure, but I suspect
p_mysql_002 is not being forced away, but its failure causes that node
to be less preferred for the resources it depends on.

> 2. pacemaker knows that it cannot stop (and hence move)
> p_vip_clust01,
> still it happily will stop all resources that depend on it in
> preparation to move them and leave them at that because it cannot
> move

I think this is the point at which the behavior is undesirable, because
it would be relevant whether the move was related to the blocked
failure or not. Feel free to open a bug report and attach the relevant
policy engine input (or a crm_report).

> them. Resources are neither restarted on current node, nor moved to
> another node. At this point I'd expect pacemaker to be smart enough
> and
> not even initiate actions that are known to be unsuccessful.
> 
> The best we can do at this point is set symmetrical=false which
> allows
> move to actually happen, but it still means downtime for resources
> that
> are moved and has its own can of worms in normal case.
--
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-17 Thread Andrei Borzenkov
17.02.2019 0:44, Eric Robinson пишет:
> Thanks for the feedback, Andrei.
> 
> I only want cluster failover to occur if the filesystem or drbd resources 
> fail, or if the cluster messaging layer detects a complete node failure. Is 
> there a way to tell PaceMaker not to trigger a cluster failover if any of the 
> p_mysql resources fail?  
> 

Let's look at this differently. If all these applications depend on each
other, you should not be able to stop individual resource in the first
place - you need to group them or define dependency so that stopping any
resource would stop everything.

If these applications are independent, they should not share resources.
Each MySQL application should have own IP and own FS and own block
device for this FS so that they can be moved between cluster nodes
independently.

Anything else will lead to troubles as you already observed.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-17 Thread Andrei Borzenkov
17.02.2019 0:33, Andrei Borzenkov пишет:
> 17.02.2019 0:03, Eric Robinson пишет:
>> Here are the relevant corosync logs.
>>
>> It appears that the stop action for resource p_mysql_002 failed, and that 
>> caused a cascading series of service changes. However, I don't understand 
>> why, since no other resources are dependent on p_mysql_002.
>>
> 
> You have mandatory colocation constraints for each SQL resource with
> VIP. it means that to move SQL resource to another node pacemaker also
> must move VIP to another node which in turn means it needs to move all
> other dependent resources as well.
> ...
>> Feb 16 14:06:39 [3912] 001db01apengine:  warning: 
>> check_migration_threshold:Forcing p_mysql_002 away from 001db01a 
>> after 100 failures (max=100)
> ...
>> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
>> Stop   p_vip_clust01 (   001db01a )   blocked
> ...
>> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
>> Stop   p_mysql_001   (   001db01a )   due to 
>> colocation with p_vip_clust01
> 

There is apparently more in it. Note that p_vip_clust01 operation is
"blocked". That is because mandatory order constraint is symmetrical by
default, so to move VIP pacemaker needs first to stop it on current
node; but before it can stop VIP it needs to (be able to) stop
p_mysql_002; but it cannot do it because by default when "stop" fails
without stonith, resource is blocked and no further actions are possible
- i.e. resource can no more (tried to) be stopped.

I still consider is rather questionable behavior. I tried to reproduce
it and I see the same.

1. After this happens resource p_mysql_002 has target=Stopped in CIB.
Why, oh why, pacemaker tries to "force away" resource that is not going
to be started on another node anyway?

2. pacemaker knows that it cannot stop (and hence move) p_vip_clust01,
still it happily will stop all resources that depend on it in
preparation to move them and leave them at that because it cannot move
them. Resources are neither restarted on current node, nor moved to
another node. At this point I'd expect pacemaker to be smart enough and
not even initiate actions that are known to be unsuccessful.

The best we can do at this point is set symmetrical=false which allows
move to actually happen, but it still means downtime for resources that
are moved and has its own can of worms in normal case.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote:
> I'm looking through the docs but I don't see how to set the on-fail value for 
> a resource. 

It is not set on the resource itself but on each of the actions (monitor, 
start, stop). 

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Andrei Borzenkov
17.02.2019 0:44, Eric Robinson пишет:
> Thanks for the feedback, Andrei.
> 
> I only want cluster failover to occur if the filesystem or drbd resources 
> fail, or if the cluster messaging layer detects a complete node failure. Is 
> there a way to tell PaceMaker not to trigger a cluster failover if any of the 
> p_mysql resources fail?  
> 

The closest you can get is disabling monitor recurring action. In this
case pacemaker will effectively ignore any resource state change.
Unfortunately this also means your resource agent must now correctly
handle requests in the wrong state - i.e. it must be able to stop
resource that had already failed earlier without returning error to
pacemaker.

You may set resource to "unmanaged", but this will also prevent
pacemaker from starting/stopping your resource at all. As compromise you
may set "unmanaged" after resource has been started and unset before
stopping it, but then you have exactly the same issue - if resource has
failed, as soon as you manage it again pacemaker will trigger
corresponding action.

Pacemaker design is different from any other cluster resources monitor I
have seen. Pacemaker is designed to maintain target resource state at
any cost. Pacemaker does not have notion of "important" or "unimportant"
resources at all. Even playing with scores won't help because failed
resource outweighs everything else with -INFINITY score thus pushing
everything dependent away from its current node.

In this particular case it may be argued that pacemaker reaction is
unjustified. Administrator explicitly set target state to "stop"
(otherwise pacemaker would not attempt to stop it) so it is unclear why
it tries to restart it on other node.

>> -Original Message-
>> From: Users  On Behalf Of Andrei
>> Borzenkov
>> Sent: Saturday, February 16, 2019 1:34 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
>> Fails?
>>
>> 17.02.2019 0:03, Eric Robinson пишет:
>>> Here are the relevant corosync logs.
>>>
>>> It appears that the stop action for resource p_mysql_002 failed, and that
>> caused a cascading series of service changes. However, I don't understand
>> why, since no other resources are dependent on p_mysql_002.
>>>
>>
>> You have mandatory colocation constraints for each SQL resource with VIP. it
>> means that to move SQL resource to another node pacemaker also must
>> move VIP to another node which in turn means it needs to move all other
>> dependent resources as well.
>> ...
>>> Feb 16 14:06:39 [3912] 001db01apengine:  warning:
>> check_migration_threshold:Forcing p_mysql_002 away from 001db01a
>> after 100 failures (max=100)
>> ...
>>> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
>>> Stop
>> p_vip_clust01 (   001db01a )   blocked
>> ...
>>> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
>>> Stop
>> p_mysql_001   (   001db01a )   due to colocation with 
>> p_vip_clust01
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
I'm looking through the docs but I don't see how to set the on-fail value for a 
resource. 


> -Original Message-
> From: Users  On Behalf Of Eric Robinson
> Sent: Saturday, February 16, 2019 1:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> > On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> > > I just noticed that. I also noticed that the lsb init script has a
> > > hard-coded stop timeout of 30 seconds. So if the init script waits
> > > longer than the cluster resource timeout of 15s, that would cause
> > > the
> >
> > Yes, you should use higher timeouts in pacemaker (45s for example).
> >
> > > resource to fail. However, I don't want cluster failover to be
> > > triggered by the failure of one of the MySQL resources. I only want
> > > cluster failover to occur if the filesystem or drbd resources fail,
> > > or if the cluster messaging layer detects a complete node failure.
> > > Is there a way to tell PaceMaker not to trigger cluster failover if
> > > any of the p_mysql resources fail?
> >
> > You can try playing with the on-fail option but I'm not sure how
> > reliably this whole setup will work without some form of fencing/stonith.
> >
> > https://clusterlabs.org/pacemaker/doc/en-
> >
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
> 
> Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
> I'm
> looking for, at least for the MySQL resources.
> 
> >
> > --
> > Valentin
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
> On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> > I just noticed that. I also noticed that the lsb init script has a
> > hard-coded stop timeout of 30 seconds. So if the init script waits
> > longer than the cluster resource timeout of 15s, that would cause the
> 
> Yes, you should use higher timeouts in pacemaker (45s for example).
> 
> > resource to fail. However, I don't want cluster failover to be
> > triggered by the failure of one of the MySQL resources. I only want
> > cluster failover to occur if the filesystem or drbd resources fail, or
> > if the cluster messaging layer detects a complete node failure. Is
> > there a way to tell PaceMaker not to trigger cluster failover if any
> > of the p_mysql resources fail?
> 
> You can try playing with the on-fail option but I'm not sure how reliably this
> whole setup will work without some form of fencing/stonith.
> 
> https://clusterlabs.org/pacemaker/doc/en-
> US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html

Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what 
I'm looking for, at least for the MySQL resources. 

> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
Thanks for the feedback, Andrei.

I only want cluster failover to occur if the filesystem or drbd resources fail, 
or if the cluster messaging layer detects a complete node failure. Is there a 
way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql 
resources fail?  

> -Original Message-
> From: Users  On Behalf Of Andrei
> Borzenkov
> Sent: Saturday, February 16, 2019 1:34 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:03, Eric Robinson пишет:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and that
> caused a cascading series of service changes. However, I don't understand
> why, since no other resources are dependent on p_mysql_002.
> >
> 
> You have mandatory colocation constraints for each SQL resource with VIP. it
> means that to move SQL resource to another node pacemaker also must
> move VIP to another node which in turn means it needs to move all other
> dependent resources as well.
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:  warning:
> check_migration_threshold:Forcing p_mysql_002 away from 001db01a
> after 100 failures (max=100)
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_vip_clust01 (   001db01a )   blocked
> ...
> > Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> > Stop
> p_mysql_001   (   001db01a )   due to colocation with 
> p_vip_clust01
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote:
> I just noticed that. I also noticed that the lsb init script has a
> hard-coded stop timeout of 30 seconds. So if the init script waits
> longer than the cluster resource timeout of 15s, that would cause the

Yes, you should use higher timeouts in pacemaker (45s for example).

> resource to fail. However, I don't want cluster failover to be
> triggered by the failure of one of the MySQL resources. I only want
> cluster failover to occur if the filesystem or drbd resources fail, or
> if the cluster messaging layer detects a complete node failure. Is
> there a way to tell PaceMaker not to trigger cluster failover if any
> of the p_mysql resources fail?  

You can try playing with the on-fail option but I'm not sure how
reliably this whole setup will work without some form of fencing/stonith.

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Andrei Borzenkov
17.02.2019 0:03, Eric Robinson пишет:
> Here are the relevant corosync logs.
> 
> It appears that the stop action for resource p_mysql_002 failed, and that 
> caused a cascading series of service changes. However, I don't understand 
> why, since no other resources are dependent on p_mysql_002.
> 

You have mandatory colocation constraints for each SQL resource with
VIP. it means that to move SQL resource to another node pacemaker also
must move VIP to another node which in turn means it needs to move all
other dependent resources as well.
...
> Feb 16 14:06:39 [3912] 001db01apengine:  warning: 
> check_migration_threshold:Forcing p_mysql_002 away from 001db01a 
> after 100 failures (max=100)
...
> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> Stop   p_vip_clust01 (   001db01a )   blocked
...
> Feb 16 14:06:39 [3912] 001db01apengine:   notice: LogAction: * 
> Stop   p_mysql_001   (   001db01a )   due to 
> colocation with p_vip_clust01

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Saturday, February 16, 2019 1:28 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> > Here are the relevant corosync logs.
> >
> > It appears that the stop action for resource p_mysql_002 failed, and
> > that caused a cascading series of service changes. However, I don't
> > understand why, since no other resources are dependent on p_mysql_002.
> 
> The stop failed because of a timeout (15s), so you can try to update that
> value:
> 


I just noticed that. I also noticed that the lsb init script has a hard-coded 
stop timeout of 30 seconds. So if the init script waits longer than the cluster 
resource timeout of 15s, that would cause the resource to fail. However, I 
don't want cluster failover to be triggered by the failure of one of the MySQL 
resources. I only want cluster failover to occur if the filesystem or drbd 
resources fail, or if the cluster messaging layer detects a complete node 
failure. Is there a way to tell PaceMaker not to trigger cluster failover if 
any of the p_mysql resources fail?  


>   Result of stop operation for p_mysql_002 on 001db01a: Timed Out |
> call=1094 key=p_mysql_002_stop_0 timeout=15000ms
> 
> After the stop failed it should have fenced that node, but you don't have
> fencing configured so it tries to move mysql_002 and all the other resources
> related to it (vip, fs, drbd) to the other node.
> Since other mysql resources depend on the same (vip, fs, drbd) they need to
> be stopped first.
> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote:
> Here are the relevant corosync logs.
> 
> It appears that the stop action for resource p_mysql_002 failed, and
> that caused a cascading series of service changes. However, I don't
> understand why, since no other resources are dependent on p_mysql_002.

The stop failed because of a timeout (15s), so you can try to update
that value:

  Result of stop operation for p_mysql_002 on 001db01a: Timed Out | call=1094 
key=p_mysql_002_stop_0 timeout=15000ms

After the stop failed it should have fenced that node, but you don't
have fencing configured so it tries to move mysql_002 and all the
other resources related to it (vip, fs, drbd) to the other node.
Since other mysql resources depend on the same (vip, fs, drbd) they
need to be stopped first.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote:
> Which logs? You mean /var/log/cluster/corosync.log?

On the DC node pacemaker will be logging the actions it is trying
to run (start or stop some resources).

> But even if the stop action is resulting in an error, why would the
> cluster also try to stop the other services which are not dependent?

When the resource is failed, pacemaker might still try to run stop for
that resource. If the lsb script is not correct that might also stop
other mysql resources. But this should all be reported in the pacemaker
log.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
Here are the relevant corosync logs.

It appears that the stop action for resource p_mysql_002 failed, and that 
caused a cascading series of service changes. However, I don't understand why, 
since no other resources are dependent on p_mysql_002.

[root@001db01a cluster]# cat corosync_filtered.log
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Forwarding cib_apply_diff operation for section 'all' to all 
(origin=local/cibadmin/2)
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
--- 0.345.30 2
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   Diff: 
+++ 0.346.0 cc0da1b030418ec8b7c72db1115e2af1
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   +  
/cib:  @epoch=346, @num_updates=0
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++ 
/cib/configuration/resources/primitive[@id='p_mysql_002']:  
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
 
Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op:   ++  
   
Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request:  
Completed cib_apply_diff operation for section 'all': OK (rc=0, 
origin=001db01a/cibadmin/2, version=0.346.0)
Feb 16 14:06:24 [3913] 001db01a   crmd: info: abort_transition_graph:   
Transition aborted by meta_attributes.p_mysql_002-meta_attributes 'create': 
Configuration change | cib=0.346.0 source=te_update_diff:456 
path=/cib/configuration/resources/primitive[@id='p_mysql_002'] complete=true
Feb 16 14:06:24 [3913] 001db01a   crmd:   notice: do_state_transition:  
State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph
Feb 16 14:06:24 [3912] 001db01apengine:   notice: unpack_config:On loss 
of CCM Quorum: Ignore
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01b is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status:  
Node 001db01a is online
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:0 active in master mode on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:0 active on 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_004 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_005 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd0:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_drbd1:1 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_001 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status:  
Operation monitor found resource p_mysql_002 active on 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 
is already processed
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_vip_clust01   (ocf::heartbeat:IPaddr2):   Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd0 [p_drbd0]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print:   
Master/Slave Set: ms_drbd1 [p_drbd1]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Masters: [ 001db01b ]
Feb 16 14:06:24 [3912] 001db01apengine: info: short_print:   
Slaves: [ 001db01a ]
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust01(ocf::heartbeat:Filesystem):Started 001db01a
Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: 
p_fs_clust02(ocf::heartbeat:Filesystem):Started 001db01b
Feb 16 14:06:24 [3912] 001db01apengine: 

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
Hi Valentin --

Which logs? You mean /var/log/cluster/corosync.log?

But even if the stop action is resulting in an error, why would the cluster 
also try to stop the other services which are not dependent?

> -Original Message-
> From: Users  On Behalf Of Valentin Vidic
> Sent: Saturday, February 16, 2019 12:44 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote:
> > Why is it that when one of the resources that start with p_mysql_*
> > goes into a FAILED state, all the other MySQL services also stop?
> 
> Perhaps stop is not working correctly for these lsb services, so for example
> stopping lsb:mysql_004 also stops the other lsb:mysql_nnn.
> 
> You would need to send the logs from the event to confirm this.
> 
> --
> Valentin
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote:
> Why is it that when one of the resources that start with p_mysql_*
> goes into a FAILED state, all the other MySQL services also stop?

Perhaps stop is not working correctly for these lsb services, so for
example stopping lsb:mysql_004 also stops the other lsb:mysql_nnn.

You would need to send the logs from the event to confirm this.

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
: p_mysql_004 (class=lsb type=mysql_004)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_004-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_004-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_004-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_004-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_004-stop-interval-0s)
Resource: p_mysql_005 (class=lsb type=mysql_005)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_005-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_005-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_005-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_005-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_005-stop-interval-0s)
Resource: p_mysql_006 (class=lsb type=mysql_006)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_006-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_006-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_006-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_006-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_006-stop-interval-0s)
Resource: p_mysql_007 (class=lsb type=mysql_007)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_007-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_007-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_007-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_007-start-interval-0s)
 stop interval=0s timeout=15 (p_mysql_007-stop-interval-0s)
Resource: p_mysql_008 (class=lsb type=mysql_008)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_008-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_008-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_008-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_008-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_008-stop-interval-0s)
Resource: p_mysql_622 (class=lsb type=mysql_622)
  Operations: force-reload interval=0s timeout=15 
(p_mysql_622-force-reload-interval-0s)
  monitor interval=15 timeout=15 (p_mysql_622-monitor-interval-15)
  restart interval=0s timeout=15 (p_mysql_622-restart-interval-0s)
  start interval=0s timeout=15 (p_mysql_622-start-interval-0s)
  stop interval=0s timeout=15 (p_mysql_622-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started) 
(id:cli-prefer-p_vip_clust02)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

Alerts:
No alerts defined

Resources Defaults:
resource-stickiness: 100
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: 001db01ab
dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
have-watchdog: false
last-lrm-refresh: 1550347798
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false

--Eric


From: Users  On Behalf Of Eric Robinson
Sent: Saturday, February 16, 2019 12:34 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: [ClusterLabs] Why Do All The Services Go

[ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Eric Robinson
These are the resources on our cluster.

[root@001db01a ~]# pcs status
Cluster name: 001db01ab
Stack: corosync
Current DC: 001db01a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with 
quorum
Last updated: Sat Feb 16 15:24:55 2019
Last change: Sat Feb 16 15:10:21 2019 by root via cibadmin on 001db01b

2 nodes configured
18 resources configured

Online: [ 001db01a 001db01b ]

Full list of resources:

p_vip_clust01  (ocf::heartbeat:IPaddr2):   Started 001db01a
Master/Slave Set: ms_drbd0 [p_drbd0]
 Masters: [ 001db01a ]
 Slaves: [ 001db01b ]
Master/Slave Set: ms_drbd1 [p_drbd1]
 Masters: [ 001db01b ]
 Slaves: [ 001db01a ]
p_fs_clust01   (ocf::heartbeat:Filesystem):Started 001db01a
p_fs_clust02   (ocf::heartbeat:Filesystem):Started 001db01b
p_vip_clust02  (ocf::heartbeat:IPaddr2):   Started 001db01b
p_mysql_001(lsb:mysql_001):Started 001db01a
p_mysql_000(lsb:mysql_000):Started 001db01a
p_mysql_002(lsb:mysql_002):Started 001db01a
p_mysql_003(lsb:mysql_003):Started 001db01a
p_mysql_004(lsb:mysql_004):Started 001db01a
p_mysql_005(lsb:mysql_005):Started 001db01a
p_mysql_006(lsb:mysql_006):Started 001db01b
p_mysql_007(lsb:mysql_007):Started 001db01b
p_mysql_008(lsb:mysql_008):Started 001db01b
p_mysql_622(lsb:mysql_622):Started 001db01a

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Why is it that when one of the resources that start with p_mysql_* goes into a 
FAILED state, all the other MySQL services also stop?

[root@001db01a ~]# pcs constraint
Location Constraints:
  Resource: p_vip_clust02
Enabled on: 001db01b (score:INFINITY) (role: Started)
Ordering Constraints:
  promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory)
  promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory)
  start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory)
  start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_001 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_002 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_003 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_004 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_005 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_006 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_007 (kind:Mandatory)
  start p_vip_clust02 then start p_mysql_008 (kind:Mandatory)
  start p_vip_clust01 then start p_mysql_622 (kind:Mandatory)
Colocation Constraints:
  p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master)
  p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master)
  p_vip_clust01 with p_fs_clust01 (score:INFINITY)
  p_vip_clust02 with p_fs_clust02 (score:INFINITY)
  p_mysql_001 with p_vip_clust01 (score:INFINITY)
  p_mysql_000 with p_vip_clust01 (score:INFINITY)
  p_mysql_002 with p_vip_clust01 (score:INFINITY)
  p_mysql_003 with p_vip_clust01 (score:INFINITY)
  p_mysql_004 with p_vip_clust01 (score:INFINITY)
  p_mysql_005 with p_vip_clust01 (score:INFINITY)
  p_mysql_006 with p_vip_clust02 (score:INFINITY)
  p_mysql_007 with p_vip_clust02 (score:INFINITY)
  p_mysql_008 with p_vip_clust02 (score:INFINITY)
  p_mysql_622 with p_vip_clust01 (score:INFINITY)
Ticket Constraints:

--Eric





___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org