Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Eric Robinson Tue, 19 Feb 2019 09:41:14 -0800

> -----Original Message-----
> From: Users <users-boun...@clusterlabs.org> On Behalf Of Andrei
> Borzenkov
> Sent: Sunday, February 17, 2019 11:56 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just One
> Fails?
> 
> 17.02.2019 0:44, Eric Robinson пишет:
> > Thanks for the feedback, Andrei.
> >
> > I only want cluster failover to occur if the filesystem or drbd resources 
> > fail,
> or if the cluster messaging layer detects a complete node failure. Is there a
> way to tell PaceMaker not to trigger a cluster failover if any of the p_mysql
> resources fail?
> >
> 
> Let's look at this differently. If all these applications depend on each 
> other,
> you should not be able to stop individual resource in the first place - you
> need to group them or define dependency so that stopping any resource
> would stop everything.
> 
> If these applications are independent, they should not share resources.
> Each MySQL application should have own IP and own FS and own block
> device for this FS so that they can be moved between cluster nodes
> independently.
> 
> Anything else will lead to troubles as you already observed.


FYI, the MySQL services do not depend on each other. All of them depend on the 
floating IP, which depends on the filesystem, which depends on DRBD, but they 
do not depend on each other. Ideally, the failure of p_mysql_002 should not 
cause failure of other mysql resources, but now I understand why it happened. 
Pacemaker wanted to start it on the other node, so it needed to move the 
floating IP, filesystem, and DRBD primary, which had the cascade effect of 
stopping the other MySQL resources.

I think I also understand why the p_vip_clust01 resource blocked. 

FWIW, we've been using Linux HA since 2006, originally Heartbeat, but then 
Corosync+Pacemaker. The past 12 years have been relatively problem free. This 
symptom is new for us, only within the past year. Our cluster nodes have many 
separate instances of MySQL running, so it is not practical to have that many 
filesystems, IPs, etc. We are content with the way things are, except for this 
new troubling behavior.

If I understand the thread correctly, op-fail=stop will not work because the 
cluster will still try to stop the resources that are implied dependencies.

Bottom line is, how do we configure the cluster in such a way that there are no 
cascading circumstances when a MySQL resource fails? Basically, if a MySQL 
resource fails, it fails. We'll deal with that on an ad-hoc basis. I don't want 
the whole cluster to barf. What about op-fail=ignore? Earlier, you suggested 
symmetrical=false might also do the trick, but you said it comes with its own 
can or worms. What are the downsides with op-fail=ignore or asymmetrical=false?

--Eric






> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

Reply via email to