So I think I found the problem. The two resources are named forwarder and bgpforwarder. It doesn't matter if bgpforwarder exists. It is just that when I set the failcount to INFINITY to a resource named bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly affects the forwarder resource.
If I change the name to forwarderbgp, the problem disappears. So it seems that the problem is that Pacemaker mixes the bgpforwarder and forwarder names. Is it a bug? Gerard On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia <ger...@talaia.io> wrote: > That makes sense. I've tried copying the anything resource and changed its > name and id (which I guess should be enough to make pacemaker think they > are different) but I still have the same problem. > > After more debugging I have reduced the problem to this: > * First cloned resource running fine > * Second cloned resource running fine > * Manually set failcount to INFINITY to second cloned resource > * Pacemaker triggers an stop operation (without monitor operation failing) > for the two resources in the node where the failcount has been set to > INFINITY. > * Reset failcount starts the two resources again > > Weirdly enough the second resource doesn't stop if I set the the the first > resource failcount to INFINITY (not even the first resource stops...). > > But: > * If I set the first resource as globally-unique=true it does not stop so > somehow this breaks the relation. > * If I manually set the failcount to 0 in the first resource that also > breaks the relation so it does not stop either. It seems like the failcount > value is being inherited from the second resource when it does not have any > value. > > I must have something wrongly configuration but I can't really see why > there is this relationship... > > Gerard > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot <kgail...@redhat.com> wrote: > >> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: >> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the >> > correctly running resource reaches the maximum number of allowed >> > failures and gets banned in all nodes. >> > >> > What is weird is that I just see how the failcount for the first >> > resource gets updated, is like the failcount are being mixed. In >> > fact, when the two resources get banned the only way I have to make >> > the first one start is to disable the failing one and clean the >> > failcount of the two resources (it is not enough to only clean the >> > failcount of the first resource) does it make sense? >> > >> > Gerard >> >> My suspicion is that you have two instances of the same service, and >> the resource agent monitor is only checking the general service, rather >> than a specific instance of it, so the monitors on both of them return >> failure if either one is failing. >> >> That would make sense why you have to disable the failing resource, so >> its monitor stops running. I can't think of why you'd have to clean its >> failcount for the other one to start, though. >> >> The "anything" agent very often causes more problems than it solves ... >> I'd recommend writing your own OCF agent tailored to your service. >> It's not much more complicated than an init script. >> >> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot <kgail...@redhat.com> >> > wrote: >> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: >> > > > Hi, >> > > > >> > > > I have a cluster with two ocf:heartbeat:anything resources each >> > > one >> > > > running as a clone in all nodes of the cluster. For some reason >> > > when >> > > > one of them fails to start the other one stops. There is not any >> > > > constrain configured or any kind of relation between them. >> > > > >> > > > Is it possible that there is some kind of implicit relation that >> > > I'm >> > > > not aware of (for example because they are the same type?) >> > > > >> > > > Thanks, >> > > > >> > > > Gerard >> > > >> > > There is no implicit relation on the Pacemaker side. However if the >> > > agent returns "failed" for both resources when either one fails, >> > > you >> > > could see something like that. I'd look at the logs on the DC and >> > > see >> > > why it decided to restart the second resource. >> > > -- >> > > Ken Gaillot <kgail...@redhat.com> >> > > >> > > _______________________________________________ >> > > Users mailing list: Users@clusterlabs.org >> > > http://lists.clusterlabs.org/mailman/listinfo/users >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc >> > > h.pdf >> > > Bugs: http://bugs.clusterlabs.org >> > > >> > >> > _______________________________________________ >> > Users mailing list: Users@clusterlabs.org >> > http://lists.clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> > pdf >> > Bugs: http://bugs.clusterlabs.org >> -- >> Ken Gaillot <kgail...@redhat.com> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org