On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > So I think I found the problem. The two resources are named forwarder > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > just that when I set the failcount to INFINITY to a resource named > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly > affects the forwarder resource. > > If I change the name to forwarderbgp, the problem disappears. So it > seems that the problem is that Pacemaker mixes the bgpforwarder and > forwarder names. Is it a bug? > > Gerard
That's really surprising. What version of pacemaker are you using? There were a lot of changes in fail count handling in the last few releases. > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <ger...@talaia.io> > wrote: > > That makes sense. I've tried copying the anything resource and > > changed its name and id (which I guess should be enough to make > > pacemaker think they are different) but I still have the same > > problem. > > > > After more debugging I have reduced the problem to this: > > * First cloned resource running fine > > * Second cloned resource running fine > > * Manually set failcount to INFINITY to second cloned resource > > * Pacemaker triggers an stop operation (without monitor operation > > failing) for the two resources in the node where the failcount has > > been set to INFINITY. > > * Reset failcount starts the two resources again > > > > Weirdly enough the second resource doesn't stop if I set the the > > the first resource failcount to INFINITY (not even the first > > resource stops...). > > > > But: > > * If I set the first resource as globally-unique=true it does not > > stop so somehow this breaks the relation. > > * If I manually set the failcount to 0 in the first resource that > > also breaks the relation so it does not stop either. It seems like > > the failcount value is being inherited from the second resource > > when it does not have any value. > > > > I must have something wrongly configuration but I can't really see > > why there is this relationship... > > > > Gerard > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgail...@redhat.com> > > wrote: > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount > > > of the > > > > correctly running resource reaches the maximum number of > > > allowed > > > > failures and gets banned in all nodes. > > > > > > > > What is weird is that I just see how the failcount for the > > > first > > > > resource gets updated, is like the failcount are being mixed. > > > In > > > > fact, when the two resources get banned the only way I have to > > > make > > > > the first one start is to disable the failing one and clean the > > > > failcount of the two resources (it is not enough to only clean > > > the > > > > failcount of the first resource) does it make sense? > > > > > > > > Gerard > > > > > > My suspicion is that you have two instances of the same service, > > > and > > > the resource agent monitor is only checking the general service, > > > rather > > > than a specific instance of it, so the monitors on both of them > > > return > > > failure if either one is failing. > > > > > > That would make sense why you have to disable the failing > > > resource, so > > > its monitor stops running. I can't think of why you'd have to > > > clean its > > > failcount for the other one to start, though. > > > > > > The "anything" agent very often causes more problems than it > > > solves ... > > > I'd recommend writing your own OCF agent tailored to your > > > service. > > > It's not much more complicated than an init script. > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot@redhat.c > > > om> > > > > wrote: > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > Hi, > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources > > > each > > > > > one > > > > > > running as a clone in all nodes of the cluster. For some > > > reason > > > > > when > > > > > > one of them fails to start the other one stops. There is > > > not any > > > > > > constrain configured or any kind of relation between them. > > > > > > > > > > > > Is it possible that there is some kind of implicit relation > > > that > > > > > I'm > > > > > > not aware of (for example because they are the same type?) > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Gerard > > > > > > > > > > There is no implicit relation on the Pacemaker side. However > > > if the > > > > > agent returns "failed" for both resources when either one > > > fails, > > > > > you > > > > > could see something like that. I'd look at the logs on the DC > > > and > > > > > see > > > > > why it decided to restart the second resource. > > > > > -- > > > > > Ken Gaillot <kgail...@redhat.com> > > > > > > > > > > _______________________________________________ > > > > > Users mailing list: Users@clusterlabs.org > > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > Project Home: http://www.clusterlabs.org > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_ > > > Scratc > > > > > h.pdf > > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > > _______________________________________________ > > > > Users mailing list: Users@clusterlabs.org > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > Project Home: http://www.clusterlabs.org > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc > > > ratch. > > > > pdf > > > > Bugs: http://bugs.clusterlabs.org > > > -- > > > Ken Gaillot <kgail...@redhat.com> > > > > > > _______________________________________________ > > > Users mailing list: Users@clusterlabs.org > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra > > > tch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org