> -----Original Message----- > From: Users <users-boun...@clusterlabs.org> On Behalf Of Ulrich Windl > Sent: Tuesday, February 19, 2019 11:35 PM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: Re: Why Do All The Services Go Down When > Just One Fails? > > >>> Eric Robinson <eric.robin...@psmnv.com<mailto:eric.robin...@psmnv.com>> > >>> schrieb am 19.02.2019 um > >>> 21:06 in > Nachricht > <MN2PR03MB4845BE22FADA30B472174B79FA7C0@MN2PR03MB4845.nampr<mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com> > d03.prod.outlook.com<mailto:mn2pr03mb4845be22fada30b472174b79fa...@mn2pr03mb4845.namprd03.prod.outlook.com>> > > >> -----Original Message----- > >> From: Users > >> <users-boun...@clusterlabs.org<mailto:users-boun...@clusterlabs.org>> On > >> Behalf Of Ken Gaillot > >> Sent: Tuesday, February 19, 2019 10:31 AM > >> To: Cluster Labs - All topics related to open-source clustering > >> welcomed <users@clusterlabs.org<mailto:users@clusterlabs.org>> > >> Subject: Re: [ClusterLabs] Why Do All The Services Go Down When Just > >> One Fails? > >> > >> On Tue, 2019-02-19 at 17:40 +0000, Eric Robinson wrote: > >> > > -----Original Message----- > >> > > From: Users > >> > > <users-boun...@clusterlabs.org<mailto:users-boun...@clusterlabs.org>> > >> > > On Behalf Of Andrei > >> > > Borzenkov > >> > > Sent: Sunday, February 17, 2019 11:56 AM > >> > > To: users@clusterlabs.org<mailto:users@clusterlabs.org> > >> > > Subject: Re: [ClusterLabs] Why Do All The Services Go Down When > >> > > Just One Fails? > >> > > > >> > > 17.02.2019 0:44, Eric Robinson пишет: > >> > > > Thanks for the feedback, Andrei. > >> > > > > >> > > > I only want cluster failover to occur if the filesystem or drbd > >> > > > resources fail, > >> > > > >> > > or if the cluster messaging layer detects a complete node failure. > >> > > Is there a > >> > > way to tell PaceMaker not to trigger a cluster failover if any of > >> > > the p_mysql resources fail? > >> > > > > >> > > > >> > > Let's look at this differently. If all these applications depend > >> > > on each other, you should not be able to stop individual resource > >> > > in the first place - you need to group them or define dependency > >> > > so that stopping any resource would stop everything. > >> > > > >> > > If these applications are independent, they should not share > >> > > resources. > >> > > Each MySQL application should have own IP and own FS and own > >> > > block device for this FS so that they can be moved between > >> > > cluster nodes independently. > >> > > > >> > > Anything else will lead to troubles as you already observed. > >> > > >> > FYI, the MySQL services do not depend on each other. All of them > >> > depend on the floating IP, which depends on the filesystem, which > >> > depends on DRBD, but they do not depend on each other. Ideally, the > >> > failure of p_mysql_002 should not cause failure of other mysql > >> > resources, but now I understand why it happened. Pacemaker wanted > >> > to start it on the other node, so it needed to move the floating > >> > IP, filesystem, and DRBD primary, which had the cascade effect of > >> > stopping the other MySQL resources. > >> > > >> > I think I also understand why the p_vip_clust01 resource blocked. > >> > > >> > FWIW, we've been using Linux HA since 2006, originally Heartbeat, > >> > but then Corosync+Pacemaker. The past 12 years have been relatively > >> > problem free. This symptom is new for us, only within the past year. > >> > Our cluster nodes have many separate instances of MySQL running, so > >> > it is not practical to have that many filesystems, IPs, etc. We are > >> > content with the way things are, except for this new troubling > >> > behavior. > >> > > >> > If I understand the thread correctly, op-fail=stop will not work > >> > because the cluster will still try to stop the resources that are > >> > implied dependencies. > >> > > >> > Bottom line is, how do we configure the cluster in such a way that > >> > there are no cascading circumstances when a MySQL resource fails? > >> > Basically, if a MySQL resource fails, it fails. We'll deal with > >> > that on an ad-hoc basis. I don't want the whole cluster to barf. > >> > What about op-fail=ignore? Earlier, you suggested symmetrical=false > >> > might also do the trick, but you said it comes with its own can or worms. > >> > What are the downsides with op-fail=ignore or asymmetrical=false? > >> > > >> > --Eric > >> > >> Even adding on-fail=ignore to the recurring monitors may not do what > >> you want, because I suspect that even an ignored failure will make > >> the node > less > >> preferable for all the other resources. But it's worth testing. > >> > >> Otherwise, your best option is to remove all the recurring monitors > >> from > the > >> mysql resources, and rely on external monitoring (e.g. nagios, > >> icinga, > > monit, > >> ...) to detect problems. > > > > This is probably a dumb question, but can we remove just the monitor > > operation but leave the resource configured in the cluster? If a node > > fails > > > over, we do want the resources to start automatically on the new > > primary node. > > Actually I wonder whether this makes sense at all: IMHO a cluster ensures > that the phone does not ring at night to make me perform some recovery > operations after a failure. Once you move to manual start and stop of > resources, I fail to see the reason for a cluster. > > When done well, independent resources should be configured (and > managed) independently; otherwise they are dependent. There is no > "middle-way". > > Regards, > Ulrich The following should show OK in a fixed font like Consolas, but the following setup is supposed to be possible, and is even referenced in the ClusterLabs documentation. +--------------+ | mysql001 +--+ +--------------+ | +--------------+ | | mysql002 +--+ +--------------+ | +--------------+ | +-------------+ +------------+ +----------+ | mysql003 +----->+ floating ip +-->+ filesystem +-->+ blockdev | +--------------+ | +-------------+ +------------+ +----------+ +--------------+ | | mysql004 +--+ +--------------+ | +--------------+ | | mysql005 +--+ +--------------+ In the layout above, the MySQL instances are dependent on the same underlying service stack, but they are not dependent on each other. Therefore, as I understand it, the failure of one MySQL instance should not cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At least, that’s the way it seems to me, but based on the thread, I guess it does not behave that way. > > > > >> -- > >> Ken Gaillot <kgail...@redhat.com<mailto:kgail...@redhat.com>> > >> > >> _______________________________________________ > >> Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org> > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> Project Home: http://www.clusterlabs.org Getting started: > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org> > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org> > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org