On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote: > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the > latest stable version in Centos 7.3 > > Gerard
Interesting ... this was an undetected bug that was coincidentally fixed by the recent fail-count work released in 1.1.17. The bug only affected cloned resources where one clone's name ended with the other's. FYI, CentOS 7.4 has 1.1.16, but that won't help this issue. > > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot <kgail...@redhat.com> > wrote: > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > > So I think I found the problem. The two resources are named > > forwarder > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > > just that when I set the failcount to INFINITY to a resource > > named > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it > > directly > > > affects the forwarder resource. > > > > > > If I change the name to forwarderbgp, the problem disappears. So > > it > > > seems that the problem is that Pacemaker mixes the bgpforwarder > > and > > > forwarder names. Is it a bug? > > > > > > Gerard > > > > That's really surprising. What version of pacemaker are you using? > > There were a lot of changes in fail count handling in the last few > > releases. > > > > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <ger...@talaia.io> > > > wrote: > > > > That makes sense. I've tried copying the anything resource and > > > > changed its name and id (which I guess should be enough to make > > > > pacemaker think they are different) but I still have the same > > > > problem. > > > > > > > > After more debugging I have reduced the problem to this: > > > > * First cloned resource running fine > > > > * Second cloned resource running fine > > > > * Manually set failcount to INFINITY to second cloned resource > > > > * Pacemaker triggers an stop operation (without monitor > > operation > > > > failing) for the two resources in the node where the failcount > > has > > > > been set to INFINITY. > > > > * Reset failcount starts the two resources again > > > > > > > > Weirdly enough the second resource doesn't stop if I set the > > the > > > > the first resource failcount to INFINITY (not even the first > > > > resource stops...). > > > > > > > > But: > > > > * If I set the first resource as globally-unique=true it does > > not > > > > stop so somehow this breaks the relation. > > > > * If I manually set the failcount to 0 in the first resource > > that > > > > also breaks the relation so it does not stop either. It seems > > like > > > > the failcount value is being inherited from the second resource > > > > when it does not have any value. > > > > > > > > I must have something wrongly configuration but I can't really > > see > > > > why there is this relationship... > > > > > > > > Gerard > > > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgaillot@redhat.c > > om> > > > > wrote: > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > > Thanks Ken. Yes, inspecting the logs seems that the > > failcount > > > > > of the > > > > > > correctly running resource reaches the maximum number of > > > > > allowed > > > > > > failures and gets banned in all nodes. > > > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > > first > > > > > > resource gets updated, is like the failcount are being > > mixed. > > > > > In > > > > > > fact, when the two resources get banned the only way I have > > to > > > > > make > > > > > > the first one start is to disable the failing one and clean > > the > > > > > > failcount of the two resources (it is not enough to only > > clean > > > > > the > > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > > > Gerard > > > > > > > > > > My suspicion is that you have two instances of the same > > service, > > > > > and > > > > > the resource agent monitor is only checking the general > > service, > > > > > rather > > > > > than a specific instance of it, so the monitors on both of > > them > > > > > return > > > > > failure if either one is failing. > > > > > > > > > > That would make sense why you have to disable the failing > > > > > resource, so > > > > > its monitor stops running. I can't think of why you'd have to > > > > > clean its > > > > > failcount for the other one to start, though. > > > > > > > > > > The "anything" agent very often causes more problems than it > > > > > solves ... > > > > > I'd recommend writing your own OCF agent tailored to your > > > > > service. > > > > > It's not much more complicated than an init script. > > > > > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgaillot@redh > > at.c > > > > > om> > > > > > > wrote: > > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything > > resources > > > > > each > > > > > > > one > > > > > > > > running as a clone in all nodes of the cluster. For > > some > > > > > reason > > > > > > > when > > > > > > > > one of them fails to start the other one stops. There > > is > > > > > not any > > > > > > > > constrain configured or any kind of relation between > > them. > > > > > > > > > > > > > > > > Is it possible that there is some kind of implicit > > relation > > > > > that > > > > > > > I'm > > > > > > > > not aware of (for example because they are the same > > type?) > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Gerard > > > > > > > > > > > > > > There is no implicit relation on the Pacemaker side. > > However > > > > > if the > > > > > > > agent returns "failed" for both resources when either one > > > > > fails, > > > > > > > you > > > > > > > could see something like that. I'd look at the logs on > > the DC > > > > > and > > > > > > > see > > > > > > > why it decided to restart the second resource. > > > > > > > -- > > > > > > > Ken Gaillot <kgail...@redhat.com> > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Users mailing list: Users@clusterlabs.org > > > > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > > > > > Project Home: http://www.clusterlabs.org > > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_f > > rom_ > > > > > Scratc > > > > > > > h.pdf > > > > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Users mailing list: Users@clusterlabs.org > > > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > > > Project Home: http://www.clusterlabs.org > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_fro > > m_Sc > > > > > ratch. > > > > > > pdf > > > > > > Bugs: http://bugs.clusterlabs.org > > > > > -- > > > > > Ken Gaillot <kgail...@redhat.com> > > > > > > > > > > _______________________________________________ > > > > > Users mailing list: Users@clusterlabs.org > > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > > > > > Project Home: http://www.clusterlabs.org > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_ > > Scra > > > > > tch.pdf > > > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Users mailing list: Users@clusterlabs.org > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra > > tch. > > > pdf > > > Bugs: http://bugs.clusterlabs.org > > -- > > Ken Gaillot <kgail...@redhat.com> > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org