Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
18.02.2019 18:53, Ken Gaillot пишет: > On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote: >> 17.02.2019 0:33, Andrei Borzenkov пишет: >>> 17.02.2019 0:03, Eric Robinson пишет: Here are the relevant corosync logs. It appears that the stop action for resource p_mysql_002 failed, and that caused a cascading series of service changes. However, I don't understand why, since no other resources are dependent on p_mysql_002. >>> >>> You have mandatory colocation constraints for each SQL resource >>> with >>> VIP. it means that to move SQL resource to another node pacemaker >>> also >>> must move VIP to another node which in turn means it needs to move >>> all >>> other dependent resources as well. >>> ... Feb 16 14:06:39 [3912] 001db01apengine: warning: check_migration_threshold:Forcing p_mysql_002 away from 001db01a after 100 failures (max=100) >>> >>> ... Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * Stop p_vip_clust01 ( 001db01a ) blocked >>> >>> ... Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * Stop p_mysql_001 ( 001db01a ) due to colocation with p_vip_clust01 >> >> There is apparently more in it. Note that p_vip_clust01 operation is >> "blocked". That is because mandatory order constraint is symmetrical >> by >> default, so to move VIP pacemaker needs first to stop it on current >> node; but before it can stop VIP it needs to (be able to) stop >> p_mysql_002; but it cannot do it because by default when "stop" fails >> without stonith, resource is blocked and no further actions are >> possible >> - i.e. resource can no more (tried to) be stopped. > > Correct, failed stop actions are special -- an on-fail policy of "stop" > or "restart" requires a stop, so obviously they can't be applied to > failed stops. As you mentioned, without fencing, on-fail defaults to > "block" for stops, which should freeze the resource as it is. > >> I still consider is rather questionable behavior. I tried to >> reproduce >> it and I see the same. >> >> 1. After this happens resource p_mysql_002 has target=Stopped in CIB. >> Why, oh why, pacemaker tries to "force away" resource that is not >> going >> to be started on another node anyway? > > Without having the policy engine inputs, I can't be sure, but I suspect > p_mysql_002 is not being forced away, but its failure causes that node > to be less preferred for the resources it depends on. > >> 2. pacemaker knows that it cannot stop (and hence move) >> p_vip_clust01, >> still it happily will stop all resources that depend on it in >> preparation to move them and leave them at that because it cannot >> move > > I think this is the point at which the behavior is undesirable, because > it would be relevant whether the move was related to the blocked > failure or not. Feel free to open a bug report and attach the relevant > policy engine input (or a crm_report). > https://bugs.clusterlabs.org/show_bug.cgi?id=5379 >> them. Resources are neither restarted on current node, nor moved to >> another node. At this point I'd expect pacemaker to be smart enough >> and >> not even initiate actions that are known to be unsuccessful. >> >> The best we can do at this point is set symmetrical=false which >> allows >> move to actually happen, but it still means downtime for resources >> that >> are moved and has its own can of worms in normal case. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
19.02.2019 23:06, Eric Robinson пишет: ... > Bottom line is, how do we configure the cluster in such a way that > there are no cascading circumstances when a MySQL resource fails? > Basically, if a MySQL resource fails, it fails. We'll deal with that > on an ad-hoc basis. I don't want the whole cluster to barf. ... > This is probably a dumb question, but can we remove just the monitor > operation but leave the resource configured in the cluster? If a node fails > over, we do want the resources to start automatically on the new primary node. While you can do it, the problem discussed in this thread was caused by failure to stop resource, not resource failure during normal operation. Logs you provided started with Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: + /cib: @epoch=346, @num_updates=0 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='p_mysql_002']: Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ so apparently administrator decided to stop this MySQL instance (I am not sure if pacemaker keeps or logs origin of CIB change or if it is even possible to determine it). So removing monitor operation would not help with this. You probably still need to set on-failure=ignore for each operation on MySQL resources to get desired behavior. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Tue, 2019-02-19 at 20:06 +, Eric Robinson wrote: > > -Original Message- > > From: Users On Behalf Of Ken > > Gaillot > > Sent: Tuesday, February 19, 2019 10:31 AM > > To: Cluster Labs - All topics related to open-source clustering > > welcomed > > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > > Just One > > Fails? > > > > On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote: > > > > -Original Message- > > > > From: Users On Behalf Of Andrei > > > > Borzenkov > > > > Sent: Sunday, February 17, 2019 11:56 AM > > > > To: users@clusterlabs.org > > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > > > > Just > > > > One Fails? > > > > > > > > 17.02.2019 0:44, Eric Robinson пишет: > > > > > Thanks for the feedback, Andrei. > > > > > > > > > > I only want cluster failover to occur if the filesystem or > > > > > drbd > > > > > resources fail, > > > > > > > > or if the cluster messaging layer detects a complete node > > > > failure. > > > > Is there a > > > > way to tell PaceMaker not to trigger a cluster failover if any > > > > of > > > > the p_mysql resources fail? > > > > > > > > > > > > > Let's look at this differently. If all these applications > > > > depend on > > > > each other, you should not be able to stop individual resource > > > > in > > > > the first place - you need to group them or define dependency > > > > so > > > > that stopping any resource would stop everything. > > > > > > > > If these applications are independent, they should not share > > > > resources. > > > > Each MySQL application should have own IP and own FS and own > > > > block > > > > device for this FS so that they can be moved between cluster > > > > nodes > > > > independently. > > > > > > > > Anything else will lead to troubles as you already observed. > > > > > > FYI, the MySQL services do not depend on each other. All of them > > > depend on the floating IP, which depends on the filesystem, which > > > depends on DRBD, but they do not depend on each other. Ideally, > > > the > > > failure of p_mysql_002 should not cause failure of other mysql > > > resources, but now I understand why it happened. Pacemaker wanted > > > to > > > start it on the other node, so it needed to move the floating IP, > > > filesystem, and DRBD primary, which had the cascade effect of > > > stopping > > > the other MySQL resources. > > > > > > I think I also understand why the p_vip_clust01 resource blocked. > > > > > > FWIW, we've been using Linux HA since 2006, originally Heartbeat, > > > but > > > then Corosync+Pacemaker. The past 12 years have been relatively > > > problem free. This symptom is new for us, only within the past > > > year. > > > Our cluster nodes have many separate instances of MySQL running, > > > so it > > > is not practical to have that many filesystems, IPs, etc. We are > > > content with the way things are, except for this new troubling > > > behavior. > > > > > > If I understand the thread correctly, op-fail=stop will not work > > > because the cluster will still try to stop the resources that are > > > implied dependencies. > > > > > > Bottom line is, how do we configure the cluster in such a way > > > that > > > there are no cascading circumstances when a MySQL resource fails? > > > Basically, if a MySQL resource fails, it fails. We'll deal with > > > that > > > on an ad-hoc basis. I don't want the whole cluster to barf. What > > > about > > > op-fail=ignore? Earlier, you suggested symmetrical=false might > > > also do > > > the trick, but you said it comes with its own can or worms. > > > What are the downsides with op-fail=ignore or asymmetrical=false? > > > > > > --Eric > > > > Even adding on-fail=ignore to the recurring monitors may not do > > what you > > want, because I suspect that even an ignored failure will make the > > node less > > preferable for all the other resources. But it's worth testing. > > > > Otherwise, your best option is to remove all the recurring monitors > > from the > > mysql resources, and rely on external monitoring (e.g. nagios, > > icinga, monit, > > ...) to detect problems. > > This is probably a dumb question, but can we remove just the monitor > operation but leave the resource configured in the cluster? If a node > fails over, we do want the resources to start automatically on the > new primary node. Yes, operations can be added/removed without affecting the configuration of the resource itself. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Ken Gaillot > Sent: Tuesday, February 19, 2019 10:31 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote: > > > -Original Message- > > > From: Users On Behalf Of Andrei > > > Borzenkov > > > Sent: Sunday, February 17, 2019 11:56 AM > > > To: users@clusterlabs.org > > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just > > > One Fails? > > > > > > 17.02.2019 0:44, Eric Robinson пишет: > > > > Thanks for the feedback, Andrei. > > > > > > > > I only want cluster failover to occur if the filesystem or drbd > > > > resources fail, > > > > > > or if the cluster messaging layer detects a complete node failure. > > > Is there a > > > way to tell PaceMaker not to trigger a cluster failover if any of > > > the p_mysql resources fail? > > > > > > > > > > Let's look at this differently. If all these applications depend on > > > each other, you should not be able to stop individual resource in > > > the first place - you need to group them or define dependency so > > > that stopping any resource would stop everything. > > > > > > If these applications are independent, they should not share > > > resources. > > > Each MySQL application should have own IP and own FS and own block > > > device for this FS so that they can be moved between cluster nodes > > > independently. > > > > > > Anything else will lead to troubles as you already observed. > > > > FYI, the MySQL services do not depend on each other. All of them > > depend on the floating IP, which depends on the filesystem, which > > depends on DRBD, but they do not depend on each other. Ideally, the > > failure of p_mysql_002 should not cause failure of other mysql > > resources, but now I understand why it happened. Pacemaker wanted to > > start it on the other node, so it needed to move the floating IP, > > filesystem, and DRBD primary, which had the cascade effect of stopping > > the other MySQL resources. > > > > I think I also understand why the p_vip_clust01 resource blocked. > > > > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but > > then Corosync+Pacemaker. The past 12 years have been relatively > > problem free. This symptom is new for us, only within the past year. > > Our cluster nodes have many separate instances of MySQL running, so it > > is not practical to have that many filesystems, IPs, etc. We are > > content with the way things are, except for this new troubling > > behavior. > > > > If I understand the thread correctly, op-fail=stop will not work > > because the cluster will still try to stop the resources that are > > implied dependencies. > > > > Bottom line is, how do we configure the cluster in such a way that > > there are no cascading circumstances when a MySQL resource fails? > > Basically, if a MySQL resource fails, it fails. We'll deal with that > > on an ad-hoc basis. I don't want the whole cluster to barf. What about > > op-fail=ignore? Earlier, you suggested symmetrical=false might also do > > the trick, but you said it comes with its own can or worms. > > What are the downsides with op-fail=ignore or asymmetrical=false? > > > > --Eric > > Even adding on-fail=ignore to the recurring monitors may not do what you > want, because I suspect that even an ignored failure will make the node less > preferable for all the other resources. But it's worth testing. > > Otherwise, your best option is to remove all the recurring monitors from the > mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit, > ...) to detect problems. This is probably a dumb question, but can we remove just the monitor operation but leave the resource configured in the cluster? If a node fails over, we do want the resources to start automatically on the new primary node. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Tue, 2019-02-19 at 17:40 +, Eric Robinson wrote: > > -Original Message- > > From: Users On Behalf Of Andrei > > Borzenkov > > Sent: Sunday, February 17, 2019 11:56 AM > > To: users@clusterlabs.org > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > > Just One > > Fails? > > > > 17.02.2019 0:44, Eric Robinson пишет: > > > Thanks for the feedback, Andrei. > > > > > > I only want cluster failover to occur if the filesystem or drbd > > > resources fail, > > > > or if the cluster messaging layer detects a complete node failure. > > Is there a > > way to tell PaceMaker not to trigger a cluster failover if any of > > the p_mysql > > resources fail? > > > > > > > Let's look at this differently. If all these applications depend on > > each other, > > you should not be able to stop individual resource in the first > > place - you > > need to group them or define dependency so that stopping any > > resource > > would stop everything. > > > > If these applications are independent, they should not share > > resources. > > Each MySQL application should have own IP and own FS and own block > > device for this FS so that they can be moved between cluster nodes > > independently. > > > > Anything else will lead to troubles as you already observed. > > FYI, the MySQL services do not depend on each other. All of them > depend on the floating IP, which depends on the filesystem, which > depends on DRBD, but they do not depend on each other. Ideally, the > failure of p_mysql_002 should not cause failure of other mysql > resources, but now I understand why it happened. Pacemaker wanted to > start it on the other node, so it needed to move the floating IP, > filesystem, and DRBD primary, which had the cascade effect of > stopping the other MySQL resources. > > I think I also understand why the p_vip_clust01 resource blocked. > > FWIW, we've been using Linux HA since 2006, originally Heartbeat, but > then Corosync+Pacemaker. The past 12 years have been relatively > problem free. This symptom is new for us, only within the past year. > Our cluster nodes have many separate instances of MySQL running, so > it is not practical to have that many filesystems, IPs, etc. We are > content with the way things are, except for this new troubling > behavior. > > If I understand the thread correctly, op-fail=stop will not work > because the cluster will still try to stop the resources that are > implied dependencies. > > Bottom line is, how do we configure the cluster in such a way that > there are no cascading circumstances when a MySQL resource fails? > Basically, if a MySQL resource fails, it fails. We'll deal with that > on an ad-hoc basis. I don't want the whole cluster to barf. What > about op-fail=ignore? Earlier, you suggested symmetrical=false might > also do the trick, but you said it comes with its own can or worms. > What are the downsides with op-fail=ignore or asymmetrical=false? > > --Eric Even adding on-fail=ignore to the recurring monitors may not do what you want, because I suspect that even an ignored failure will make the node less preferable for all the other resources. But it's worth testing. Otherwise, your best option is to remove all the recurring monitors from the mysql resources, and rely on external monitoring (e.g. nagios, icinga, monit, ...) to detect problems. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Sunday, February 17, 2019 11:56 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > 17.02.2019 0:44, Eric Robinson пишет: > > Thanks for the feedback, Andrei. > > > > I only want cluster failover to occur if the filesystem or drbd resources > > fail, > or if the cluster messaging layer detects a complete node failure. Is there a > way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql > resources fail? > > > > Let's look at this differently. If all these applications depend on each > other, > you should not be able to stop individual resource in the first place - you > need to group them or define dependency so that stopping any resource > would stop everything. > > If these applications are independent, they should not share resources. > Each MySQL application should have own IP and own FS and own block > device for this FS so that they can be moved between cluster nodes > independently. > > Anything else will lead to troubles as you already observed. FYI, the MySQL services do not depend on each other. All of them depend on the floating IP, which depends on the filesystem, which depends on DRBD, but they do not depend on each other. Ideally, the failure of p_mysql_002 should not cause failure of other mysql resources, but now I understand why it happened. Pacemaker wanted to start it on the other node, so it needed to move the floating IP, filesystem, and DRBD primary, which had the cascade effect of stopping the other MySQL resources. I think I also understand why the p_vip_clust01 resource blocked. FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then Corosync+Pacemaker. The past 12 years have been relatively problem free. This symptom is new for us, only within the past year. Our cluster nodes have many separate instances of MySQL running, so it is not practical to have that many filesystems, IPs, etc. We are content with the way things are, except for this new troubling behavior. If I understand the thread correctly, op-fail=stop will not work because the cluster will still try to stop the resources that are implied dependencies. Bottom line is, how do we configure the cluster in such a way that there are no cascading circumstances when a MySQL resource fails? Basically, if a MySQL resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested symmetrical=false might also do the trick, but you said it comes with its own can or worms. What are the downsides with op-fail=ignore or asymmetrical=false? --Eric > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote: > 17.02.2019 0:33, Andrei Borzenkov пишет: > > 17.02.2019 0:03, Eric Robinson пишет: > > > Here are the relevant corosync logs. > > > > > > It appears that the stop action for resource p_mysql_002 failed, > > > and that caused a cascading series of service changes. However, I > > > don't understand why, since no other resources are dependent on > > > p_mysql_002. > > > > > > > You have mandatory colocation constraints for each SQL resource > > with > > VIP. it means that to move SQL resource to another node pacemaker > > also > > must move VIP to another node which in turn means it needs to move > > all > > other dependent resources as well. > > ... > > > Feb 16 14:06:39 [3912] 001db01apengine: warning: > > > check_migration_threshold:Forcing p_mysql_002 away from > > > 001db01a after 100 failures (max=100) > > > > ... > > > Feb 16 14:06:39 [3912] 001db01apengine: notice: > > > LogAction: * > > > Stop p_vip_clust01 ( 001db01a > > > ) blocked > > > > ... > > > Feb 16 14:06:39 [3912] 001db01apengine: notice: > > > LogAction: * > > > Stop p_mysql_001 ( 001db01a ) due > > > to colocation with p_vip_clust01 > > There is apparently more in it. Note that p_vip_clust01 operation is > "blocked". That is because mandatory order constraint is symmetrical > by > default, so to move VIP pacemaker needs first to stop it on current > node; but before it can stop VIP it needs to (be able to) stop > p_mysql_002; but it cannot do it because by default when "stop" fails > without stonith, resource is blocked and no further actions are > possible > - i.e. resource can no more (tried to) be stopped. Correct, failed stop actions are special -- an on-fail policy of "stop" or "restart" requires a stop, so obviously they can't be applied to failed stops. As you mentioned, without fencing, on-fail defaults to "block" for stops, which should freeze the resource as it is. > I still consider is rather questionable behavior. I tried to > reproduce > it and I see the same. > > 1. After this happens resource p_mysql_002 has target=Stopped in CIB. > Why, oh why, pacemaker tries to "force away" resource that is not > going > to be started on another node anyway? Without having the policy engine inputs, I can't be sure, but I suspect p_mysql_002 is not being forced away, but its failure causes that node to be less preferred for the resources it depends on. > 2. pacemaker knows that it cannot stop (and hence move) > p_vip_clust01, > still it happily will stop all resources that depend on it in > preparation to move them and leave them at that because it cannot > move I think this is the point at which the behavior is undesirable, because it would be relevant whether the move was related to the blocked failure or not. Feel free to open a bug report and attach the relevant policy engine input (or a crm_report). > them. Resources are neither restarted on current node, nor moved to > another node. At this point I'd expect pacemaker to be smart enough > and > not even initiate actions that are known to be unsuccessful. > > The best we can do at this point is set symmetrical=false which > allows > move to actually happen, but it still means downtime for resources > that > are moved and has its own can of worms in normal case. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
17.02.2019 0:44, Eric Robinson пишет: > Thanks for the feedback, Andrei. > > I only want cluster failover to occur if the filesystem or drbd resources > fail, or if the cluster messaging layer detects a complete node failure. Is > there a way to tell PaceMaker not to trigger a cluster failover if any of the > p_mysql resources fail? > Let's look at this differently. If all these applications depend on each other, you should not be able to stop individual resource in the first place - you need to group them or define dependency so that stopping any resource would stop everything. If these applications are independent, they should not share resources. Each MySQL application should have own IP and own FS and own block device for this FS so that they can be moved between cluster nodes independently. Anything else will lead to troubles as you already observed. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
17.02.2019 0:33, Andrei Borzenkov пишет: > 17.02.2019 0:03, Eric Robinson пишет: >> Here are the relevant corosync logs. >> >> It appears that the stop action for resource p_mysql_002 failed, and that >> caused a cascading series of service changes. However, I don't understand >> why, since no other resources are dependent on p_mysql_002. >> > > You have mandatory colocation constraints for each SQL resource with > VIP. it means that to move SQL resource to another node pacemaker also > must move VIP to another node which in turn means it needs to move all > other dependent resources as well. > ... >> Feb 16 14:06:39 [3912] 001db01apengine: warning: >> check_migration_threshold:Forcing p_mysql_002 away from 001db01a >> after 100 failures (max=100) > ... >> Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * >> Stop p_vip_clust01 ( 001db01a ) blocked > ... >> Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * >> Stop p_mysql_001 ( 001db01a ) due to >> colocation with p_vip_clust01 > There is apparently more in it. Note that p_vip_clust01 operation is "blocked". That is because mandatory order constraint is symmetrical by default, so to move VIP pacemaker needs first to stop it on current node; but before it can stop VIP it needs to (be able to) stop p_mysql_002; but it cannot do it because by default when "stop" fails without stonith, resource is blocked and no further actions are possible - i.e. resource can no more (tried to) be stopped. I still consider is rather questionable behavior. I tried to reproduce it and I see the same. 1. After this happens resource p_mysql_002 has target=Stopped in CIB. Why, oh why, pacemaker tries to "force away" resource that is not going to be started on another node anyway? 2. pacemaker knows that it cannot stop (and hence move) p_vip_clust01, still it happily will stop all resources that depend on it in preparation to move them and leave them at that because it cannot move them. Resources are neither restarted on current node, nor moved to another node. At this point I'd expect pacemaker to be smart enough and not even initiate actions that are known to be unsuccessful. The best we can do at this point is set symmetrical=false which allows move to actually happen, but it still means downtime for resources that are moved and has its own can of worms in normal case. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote: > I'm looking through the docs but I don't see how to set the on-fail value for > a resource. It is not set on the resource itself but on each of the actions (monitor, start, stop). -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
17.02.2019 0:44, Eric Robinson пишет: > Thanks for the feedback, Andrei. > > I only want cluster failover to occur if the filesystem or drbd resources > fail, or if the cluster messaging layer detects a complete node failure. Is > there a way to tell PaceMaker not to trigger a cluster failover if any of the > p_mysql resources fail? > The closest you can get is disabling monitor recurring action. In this case pacemaker will effectively ignore any resource state change. Unfortunately this also means your resource agent must now correctly handle requests in the wrong state - i.e. it must be able to stop resource that had already failed earlier without returning error to pacemaker. You may set resource to "unmanaged", but this will also prevent pacemaker from starting/stopping your resource at all. As compromise you may set "unmanaged" after resource has been started and unset before stopping it, but then you have exactly the same issue - if resource has failed, as soon as you manage it again pacemaker will trigger corresponding action. Pacemaker design is different from any other cluster resources monitor I have seen. Pacemaker is designed to maintain target resource state at any cost. Pacemaker does not have notion of "important" or "unimportant" resources at all. Even playing with scores won't help because failed resource outweighs everything else with -INFINITY score thus pushing everything dependent away from its current node. In this particular case it may be argued that pacemaker reaction is unjustified. Administrator explicitly set target state to "stop" (otherwise pacemaker would not attempt to stop it) so it is unclear why it tries to restart it on other node. >> -Original Message- >> From: Users On Behalf Of Andrei >> Borzenkov >> Sent: Saturday, February 16, 2019 1:34 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One >> Fails? >> >> 17.02.2019 0:03, Eric Robinson пишет: >>> Here are the relevant corosync logs. >>> >>> It appears that the stop action for resource p_mysql_002 failed, and that >> caused a cascading series of service changes. However, I don't understand >> why, since no other resources are dependent on p_mysql_002. >>> >> >> You have mandatory colocation constraints for each SQL resource with VIP. it >> means that to move SQL resource to another node pacemaker also must >> move VIP to another node which in turn means it needs to move all other >> dependent resources as well. >> ... >>> Feb 16 14:06:39 [3912] 001db01apengine: warning: >> check_migration_threshold:Forcing p_mysql_002 away from 001db01a >> after 100 failures (max=100) >> ... >>> Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * >>> Stop >> p_vip_clust01 ( 001db01a ) blocked >> ... >>> Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * >>> Stop >> p_mysql_001 ( 001db01a ) due to colocation with >> p_vip_clust01 >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
I'm looking through the docs but I don't see how to set the on-fail value for a resource. > -Original Message- > From: Users On Behalf Of Eric Robinson > Sent: Saturday, February 16, 2019 1:47 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > > On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > > > I just noticed that. I also noticed that the lsb init script has a > > > hard-coded stop timeout of 30 seconds. So if the init script waits > > > longer than the cluster resource timeout of 15s, that would cause > > > the > > > > Yes, you should use higher timeouts in pacemaker (45s for example). > > > > > resource to fail. However, I don't want cluster failover to be > > > triggered by the failure of one of the MySQL resources. I only want > > > cluster failover to occur if the filesystem or drbd resources fail, > > > or if the cluster messaging layer detects a complete node failure. > > > Is there a way to tell PaceMaker not to trigger cluster failover if > > > any of the p_mysql resources fail? > > > > You can try playing with the on-fail option but I'm not sure how > > reliably this whole setup will work without some form of fencing/stonith. > > > > https://clusterlabs.org/pacemaker/doc/en- > > > US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html > > Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what > I'm > looking for, at least for the MySQL resources. > > > > > -- > > Valentin > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > > I just noticed that. I also noticed that the lsb init script has a > > hard-coded stop timeout of 30 seconds. So if the init script waits > > longer than the cluster resource timeout of 15s, that would cause the > > Yes, you should use higher timeouts in pacemaker (45s for example). > > > resource to fail. However, I don't want cluster failover to be > > triggered by the failure of one of the MySQL resources. I only want > > cluster failover to occur if the filesystem or drbd resources fail, or > > if the cluster messaging layer detects a complete node failure. Is > > there a way to tell PaceMaker not to trigger cluster failover if any > > of the p_mysql resources fail? > > You can try playing with the on-fail option but I'm not sure how reliably this > whole setup will work without some form of fencing/stonith. > > https://clusterlabs.org/pacemaker/doc/en- > US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html Thanks for the tip. It looks like on-fail=ignore or on-fail=stop may be what I'm looking for, at least for the MySQL resources. > > -- > Valentin > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
Thanks for the feedback, Andrei. I only want cluster failover to occur if the filesystem or drbd resources fail, or if the cluster messaging layer detects a complete node failure. Is there a way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql resources fail? > -Original Message- > From: Users On Behalf Of Andrei > Borzenkov > Sent: Saturday, February 16, 2019 1:34 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > 17.02.2019 0:03, Eric Robinson пишет: > > Here are the relevant corosync logs. > > > > It appears that the stop action for resource p_mysql_002 failed, and that > caused a cascading series of service changes. However, I don't understand > why, since no other resources are dependent on p_mysql_002. > > > > You have mandatory colocation constraints for each SQL resource with VIP. it > means that to move SQL resource to another node pacemaker also must > move VIP to another node which in turn means it needs to move all other > dependent resources as well. > ... > > Feb 16 14:06:39 [3912] 001db01apengine: warning: > check_migration_threshold:Forcing p_mysql_002 away from 001db01a > after 100 failures (max=100) > ... > > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > > Stop > p_vip_clust01 ( 001db01a ) blocked > ... > > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > > Stop > p_mysql_001 ( 001db01a ) due to colocation with > p_vip_clust01 > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > I just noticed that. I also noticed that the lsb init script has a > hard-coded stop timeout of 30 seconds. So if the init script waits > longer than the cluster resource timeout of 15s, that would cause the Yes, you should use higher timeouts in pacemaker (45s for example). > resource to fail. However, I don't want cluster failover to be > triggered by the failure of one of the MySQL resources. I only want > cluster failover to occur if the filesystem or drbd resources fail, or > if the cluster messaging layer detects a complete node failure. Is > there a way to tell PaceMaker not to trigger cluster failover if any > of the p_mysql resources fail? You can try playing with the on-fail option but I'm not sure how reliably this whole setup will work without some form of fencing/stonith. https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
17.02.2019 0:03, Eric Robinson пишет: > Here are the relevant corosync logs. > > It appears that the stop action for resource p_mysql_002 failed, and that > caused a cascading series of service changes. However, I don't understand > why, since no other resources are dependent on p_mysql_002. > You have mandatory colocation constraints for each SQL resource with VIP. it means that to move SQL resource to another node pacemaker also must move VIP to another node which in turn means it needs to move all other dependent resources as well. ... > Feb 16 14:06:39 [3912] 001db01apengine: warning: > check_migration_threshold:Forcing p_mysql_002 away from 001db01a > after 100 failures (max=100) ... > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > Stop p_vip_clust01 ( 001db01a ) blocked ... > Feb 16 14:06:39 [3912] 001db01apengine: notice: LogAction: * > Stop p_mysql_001 ( 001db01a ) due to > colocation with p_vip_clust01 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
> -Original Message- > From: Users On Behalf Of Valentin Vidic > Sent: Saturday, February 16, 2019 1:28 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote: > > Here are the relevant corosync logs. > > > > It appears that the stop action for resource p_mysql_002 failed, and > > that caused a cascading series of service changes. However, I don't > > understand why, since no other resources are dependent on p_mysql_002. > > The stop failed because of a timeout (15s), so you can try to update that > value: > I just noticed that. I also noticed that the lsb init script has a hard-coded stop timeout of 30 seconds. So if the init script waits longer than the cluster resource timeout of 15s, that would cause the resource to fail. However, I don't want cluster failover to be triggered by the failure of one of the MySQL resources. I only want cluster failover to occur if the filesystem or drbd resources fail, or if the cluster messaging layer detects a complete node failure. Is there a way to tell PaceMaker not to trigger cluster failover if any of the p_mysql resources fail? > Result of stop operation for p_mysql_002 on 001db01a: Timed Out | > call=1094 key=p_mysql_002_stop_0 timeout=15000ms > > After the stop failed it should have fenced that node, but you don't have > fencing configured so it tries to move mysql_002 and all the other resources > related to it (vip, fs, drbd) to the other node. > Since other mysql resources depend on the same (vip, fs, drbd) they need to > be stopped first. > > -- > Valentin > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote: > Here are the relevant corosync logs. > > It appears that the stop action for resource p_mysql_002 failed, and > that caused a cascading series of service changes. However, I don't > understand why, since no other resources are dependent on p_mysql_002. The stop failed because of a timeout (15s), so you can try to update that value: Result of stop operation for p_mysql_002 on 001db01a: Timed Out | call=1094 key=p_mysql_002_stop_0 timeout=15000ms After the stop failed it should have fenced that node, but you don't have fencing configured so it tries to move mysql_002 and all the other resources related to it (vip, fs, drbd) to the other node. Since other mysql resources depend on the same (vip, fs, drbd) they need to be stopped first. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote: > Which logs? You mean /var/log/cluster/corosync.log? On the DC node pacemaker will be logging the actions it is trying to run (start or stop some resources). > But even if the stop action is resulting in an error, why would the > cluster also try to stop the other services which are not dependent? When the resource is failed, pacemaker might still try to run stop for that resource. If the lsb script is not correct that might also stop other mysql resources. But this should all be reported in the pacemaker log. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
Here are the relevant corosync logs. It appears that the stop action for resource p_mysql_002 failed, and that caused a cascading series of service changes. However, I don't understand why, since no other resources are dependent on p_mysql_002. [root@001db01a cluster]# cat corosync_filtered.log Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2) Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: Diff: --- 0.345.30 2 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: Diff: +++ 0.346.0 cc0da1b030418ec8b7c72db1115e2af1 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: + /cib: @epoch=346, @num_updates=0 Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ /cib/configuration/resources/primitive[@id='p_mysql_002']: Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ Feb 16 14:06:24 [3908] 001db01acib: info: cib_perform_op: ++ Feb 16 14:06:24 [3908] 001db01acib: info: cib_process_request: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=001db01a/cibadmin/2, version=0.346.0) Feb 16 14:06:24 [3913] 001db01a crmd: info: abort_transition_graph: Transition aborted by meta_attributes.p_mysql_002-meta_attributes 'create': Configuration change | cib=0.346.0 source=te_update_diff:456 path=/cib/configuration/resources/primitive[@id='p_mysql_002'] complete=true Feb 16 14:06:24 [3913] 001db01a crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Feb 16 14:06:24 [3912] 001db01apengine: notice: unpack_config:On loss of CCM Quorum: Ignore Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status: Node 001db01b is online Feb 16 14:06:24 [3912] 001db01apengine: info: determine_online_status: Node 001db01a is online Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd0:0 active in master mode on 001db01b Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd1:0 active on 001db01b Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_004 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_005 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd0:1 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_drbd1:1 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_001 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_002 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: determine_op_status: Operation monitor found resource p_mysql_002 active on 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 2 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: unpack_node_loop: Node 1 is already processed Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print: Master/Slave Set: ms_drbd0 [p_drbd0] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Masters: [ 001db01a ] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Slaves: [ 001db01b ] Feb 16 14:06:24 [3912] 001db01apengine: info: clone_print: Master/Slave Set: ms_drbd1 [p_drbd1] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Masters: [ 001db01b ] Feb 16 14:06:24 [3912] 001db01apengine: info: short_print: Slaves: [ 001db01a ] Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_fs_clust01(ocf::heartbeat:Filesystem):Started 001db01a Feb 16 14:06:24 [3912] 001db01apengine: info: common_print: p_fs_clust02(ocf::heartbeat:Filesystem):Started 001db01b Feb 16 14:06:24 [3912] 001db01apengine:
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
Hi Valentin -- Which logs? You mean /var/log/cluster/corosync.log? But even if the stop action is resulting in an error, why would the cluster also try to stop the other services which are not dependent? > -Original Message- > From: Users On Behalf Of Valentin Vidic > Sent: Saturday, February 16, 2019 12:44 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One > Fails? > > On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote: > > Why is it that when one of the resources that start with p_mysql_* > > goes into a FAILED state, all the other MySQL services also stop? > > Perhaps stop is not working correctly for these lsb services, so for example > stopping lsb:mysql_004 also stops the other lsb:mysql_nnn. > > You would need to send the logs from the event to confirm this. > > -- > Valentin > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote: > Why is it that when one of the resources that start with p_mysql_* > goes into a FAILED state, all the other MySQL services also stop? Perhaps stop is not working correctly for these lsb services, so for example stopping lsb:mysql_004 also stops the other lsb:mysql_nnn. You would need to send the logs from the event to confirm this. -- Valentin ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?
: p_mysql_004 (class=lsb type=mysql_004) Operations: force-reload interval=0s timeout=15 (p_mysql_004-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_004-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_004-restart-interval-0s) start interval=0s timeout=15 (p_mysql_004-start-interval-0s) stop interval=0s timeout=15 (p_mysql_004-stop-interval-0s) Resource: p_mysql_005 (class=lsb type=mysql_005) Operations: force-reload interval=0s timeout=15 (p_mysql_005-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_005-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_005-restart-interval-0s) start interval=0s timeout=15 (p_mysql_005-start-interval-0s) stop interval=0s timeout=15 (p_mysql_005-stop-interval-0s) Resource: p_mysql_006 (class=lsb type=mysql_006) Operations: force-reload interval=0s timeout=15 (p_mysql_006-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_006-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_006-restart-interval-0s) start interval=0s timeout=15 (p_mysql_006-start-interval-0s) stop interval=0s timeout=15 (p_mysql_006-stop-interval-0s) Resource: p_mysql_007 (class=lsb type=mysql_007) Operations: force-reload interval=0s timeout=15 (p_mysql_007-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_007-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_007-restart-interval-0s) start interval=0s timeout=15 (p_mysql_007-start-interval-0s) stop interval=0s timeout=15 (p_mysql_007-stop-interval-0s) Resource: p_mysql_008 (class=lsb type=mysql_008) Operations: force-reload interval=0s timeout=15 (p_mysql_008-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_008-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_008-restart-interval-0s) start interval=0s timeout=15 (p_mysql_008-start-interval-0s) stop interval=0s timeout=15 (p_mysql_008-stop-interval-0s) Resource: p_mysql_622 (class=lsb type=mysql_622) Operations: force-reload interval=0s timeout=15 (p_mysql_622-force-reload-interval-0s) monitor interval=15 timeout=15 (p_mysql_622-monitor-interval-15) restart interval=0s timeout=15 (p_mysql_622-restart-interval-0s) start interval=0s timeout=15 (p_mysql_622-start-interval-0s) stop interval=0s timeout=15 (p_mysql_622-stop-interval-0s) Stonith Devices: Fencing Levels: Location Constraints: Resource: p_vip_clust02 Enabled on: 001db01b (score:INFINITY) (role: Started) (id:cli-prefer-p_vip_clust02) Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) start p_vip_clust01 then start p_mysql_002 (kind:Mandatory) start p_vip_clust01 then start p_mysql_003 (kind:Mandatory) start p_vip_clust01 then start p_mysql_004 (kind:Mandatory) start p_vip_clust01 then start p_mysql_005 (kind:Mandatory) start p_vip_clust02 then start p_mysql_006 (kind:Mandatory) start p_vip_clust02 then start p_mysql_007 (kind:Mandatory) start p_vip_clust02 then start p_mysql_008 (kind:Mandatory) start p_vip_clust01 then start p_mysql_622 (kind:Mandatory) Colocation Constraints: p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) p_vip_clust01 with p_fs_clust01 (score:INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) p_mysql_000 with p_vip_clust01 (score:INFINITY) p_mysql_002 with p_vip_clust01 (score:INFINITY) p_mysql_003 with p_vip_clust01 (score:INFINITY) p_mysql_004 with p_vip_clust01 (score:INFINITY) p_mysql_005 with p_vip_clust01 (score:INFINITY) p_mysql_006 with p_vip_clust02 (score:INFINITY) p_mysql_007 with p_vip_clust02 (score:INFINITY) p_mysql_008 with p_vip_clust02 (score:INFINITY) p_mysql_622 with p_vip_clust01 (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: 100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: 001db01ab dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1550347798 maintenance-mode: false no-quorum-policy: ignore stonith-enabled: false --Eric From: Users On Behalf Of Eric Robinson Sent: Saturday, February 16, 2019 12:34 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: [ClusterLabs] Why Do All The Services Go
[ClusterLabs] Why Do All The Services Go Down When Just One Fails?
These are the resources on our cluster. [root@001db01a ~]# pcs status Cluster name: 001db01ab Stack: corosync Current DC: 001db01a (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Sat Feb 16 15:24:55 2019 Last change: Sat Feb 16 15:10:21 2019 by root via cibadmin on 001db01b 2 nodes configured 18 resources configured Online: [ 001db01a 001db01b ] Full list of resources: p_vip_clust01 (ocf::heartbeat:IPaddr2): Started 001db01a Master/Slave Set: ms_drbd0 [p_drbd0] Masters: [ 001db01a ] Slaves: [ 001db01b ] Master/Slave Set: ms_drbd1 [p_drbd1] Masters: [ 001db01b ] Slaves: [ 001db01a ] p_fs_clust01 (ocf::heartbeat:Filesystem):Started 001db01a p_fs_clust02 (ocf::heartbeat:Filesystem):Started 001db01b p_vip_clust02 (ocf::heartbeat:IPaddr2): Started 001db01b p_mysql_001(lsb:mysql_001):Started 001db01a p_mysql_000(lsb:mysql_000):Started 001db01a p_mysql_002(lsb:mysql_002):Started 001db01a p_mysql_003(lsb:mysql_003):Started 001db01a p_mysql_004(lsb:mysql_004):Started 001db01a p_mysql_005(lsb:mysql_005):Started 001db01a p_mysql_006(lsb:mysql_006):Started 001db01b p_mysql_007(lsb:mysql_007):Started 001db01b p_mysql_008(lsb:mysql_008):Started 001db01b p_mysql_622(lsb:mysql_622):Started 001db01a Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Why is it that when one of the resources that start with p_mysql_* goes into a FAILED state, all the other MySQL services also stop? [root@001db01a ~]# pcs constraint Location Constraints: Resource: p_vip_clust02 Enabled on: 001db01b (score:INFINITY) (role: Started) Ordering Constraints: promote ms_drbd0 then start p_fs_clust01 (kind:Mandatory) promote ms_drbd1 then start p_fs_clust02 (kind:Mandatory) start p_fs_clust01 then start p_vip_clust01 (kind:Mandatory) start p_fs_clust02 then start p_vip_clust02 (kind:Mandatory) start p_vip_clust01 then start p_mysql_001 (kind:Mandatory) start p_vip_clust01 then start p_mysql_002 (kind:Mandatory) start p_vip_clust01 then start p_mysql_003 (kind:Mandatory) start p_vip_clust01 then start p_mysql_004 (kind:Mandatory) start p_vip_clust01 then start p_mysql_005 (kind:Mandatory) start p_vip_clust02 then start p_mysql_006 (kind:Mandatory) start p_vip_clust02 then start p_mysql_007 (kind:Mandatory) start p_vip_clust02 then start p_mysql_008 (kind:Mandatory) start p_vip_clust01 then start p_mysql_622 (kind:Mandatory) Colocation Constraints: p_fs_clust01 with ms_drbd0 (score:INFINITY) (with-rsc-role:Master) p_fs_clust02 with ms_drbd1 (score:INFINITY) (with-rsc-role:Master) p_vip_clust01 with p_fs_clust01 (score:INFINITY) p_vip_clust02 with p_fs_clust02 (score:INFINITY) p_mysql_001 with p_vip_clust01 (score:INFINITY) p_mysql_000 with p_vip_clust01 (score:INFINITY) p_mysql_002 with p_vip_clust01 (score:INFINITY) p_mysql_003 with p_vip_clust01 (score:INFINITY) p_mysql_004 with p_vip_clust01 (score:INFINITY) p_mysql_005 with p_vip_clust01 (score:INFINITY) p_mysql_006 with p_vip_clust02 (score:INFINITY) p_mysql_007 with p_vip_clust02 (score:INFINITY) p_mysql_008 with p_vip_clust02 (score:INFINITY) p_mysql_622 with p_vip_clust01 (score:INFINITY) Ticket Constraints: --Eric ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org