Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
I'm so lucky :) thanks for your help! Gerard On Thu, Oct 19, 2017 at 12:04 AM, Ken Gaillot wrote: > On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote: > > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the > > latest stable version in Centos 7.3 > > > > Gerard > > Interesting ... this was an undetected bug that was coincidentally > fixed by the recent fail-count work released in 1.1.17. The bug only > affected cloned resources where one clone's name ended with the > other's. > > FYI, CentOS 7.4 has 1.1.16, but that won't help this issue. > > > > > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot > > wrote: > > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > > > So I think I found the problem. The two resources are named > > > forwarder > > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > > > just that when I set the failcount to INFINITY to a resource > > > named > > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it > > > directly > > > > affects the forwarder resource. > > > > > > > > If I change the name to forwarderbgp, the problem disappears. So > > > it > > > > seems that the problem is that Pacemaker mixes the bgpforwarder > > > and > > > > forwarder names. Is it a bug? > > > > > > > > Gerard > > > > > > That's really surprising. What version of pacemaker are you using? > > > There were a lot of changes in fail count handling in the last few > > > releases. > > > > > > > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > > > > wrote: > > > > > That makes sense. I've tried copying the anything resource and > > > > > changed its name and id (which I guess should be enough to make > > > > > pacemaker think they are different) but I still have the same > > > > > problem. > > > > > > > > > > After more debugging I have reduced the problem to this: > > > > > * First cloned resource running fine > > > > > * Second cloned resource running fine > > > > > * Manually set failcount to INFINITY to second cloned resource > > > > > * Pacemaker triggers an stop operation (without monitor > > > operation > > > > > failing) for the two resources in the node where the failcount > > > has > > > > > been set to INFINITY. > > > > > * Reset failcount starts the two resources again > > > > > > > > > > Weirdly enough the second resource doesn't stop if I set the > > > the > > > > > the first resource failcount to INFINITY (not even the first > > > > > resource stops...). > > > > > > > > > > But: > > > > > * If I set the first resource as globally-unique=true it does > > > not > > > > > stop so somehow this breaks the relation. > > > > > * If I manually set the failcount to 0 in the first resource > > > that > > > > > also breaks the relation so it does not stop either. It seems > > > like > > > > > the failcount value is being inherited from the second resource > > > > > when it does not have any value. > > > > > > > > > > I must have something wrongly configuration but I can't really > > > see > > > > > why there is this relationship... > > > > > > > > > > Gerard > > > > > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > > om> > > > > > wrote: > > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > > > Thanks Ken. Yes, inspecting the logs seems that the > > > failcount > > > > > > of the > > > > > > > correctly running resource reaches the maximum number of > > > > > > allowed > > > > > > > failures and gets banned in all nodes. > > > > > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > > > first > > > > > > > resource gets updated, is like the failcount are being > > > mixed. > > > > > > In > > > > > > > fact, when the two resources get banned the only way I have > > > to > > > > > > make > > > > > > > the first one start is to disable the failing one and clean > > > the > > > > > > > failcount of the two resources (it is not enough to only > > > clean > > > > > > the > > > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > > > > > Gerard > > > > > > > > > > > > My suspicion is that you have two instances of the same > > > service, > > > > > > and > > > > > > the resource agent monitor is only checking the general > > > service, > > > > > > rather > > > > > > than a specific instance of it, so the monitors on both of > > > them > > > > > > return > > > > > > failure if either one is failing. > > > > > > > > > > > > That would make sense why you have to disable the failing > > > > > > resource, so > > > > > > its monitor stops running. I can't think of why you'd have to > > > > > > clean its > > > > > > failcount for the other one to start, though. > > > > > > > > > > > > The "anything" agent very often causes more problems than it > > > > > > solves ... > > > > > > I'd recommend writing your own OCF agent tailored to your > > > > > > service. > > > > > > It's not much more complicated than an init script. > > > > > > > > > > > > > On Mon, Oct 16,
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote: > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the > latest stable version in Centos 7.3 > > Gerard Interesting ... this was an undetected bug that was coincidentally fixed by the recent fail-count work released in 1.1.17. The bug only affected cloned resources where one clone's name ended with the other's. FYI, CentOS 7.4 has 1.1.16, but that won't help this issue. > > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot > wrote: > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > > So I think I found the problem. The two resources are named > > forwarder > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > > just that when I set the failcount to INFINITY to a resource > > named > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it > > directly > > > affects the forwarder resource. > > > > > > If I change the name to forwarderbgp, the problem disappears. So > > it > > > seems that the problem is that Pacemaker mixes the bgpforwarder > > and > > > forwarder names. Is it a bug? > > > > > > Gerard > > > > That's really surprising. What version of pacemaker are you using? > > There were a lot of changes in fail count handling in the last few > > releases. > > > > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > > > wrote: > > > > That makes sense. I've tried copying the anything resource and > > > > changed its name and id (which I guess should be enough to make > > > > pacemaker think they are different) but I still have the same > > > > problem. > > > > > > > > After more debugging I have reduced the problem to this: > > > > * First cloned resource running fine > > > > * Second cloned resource running fine > > > > * Manually set failcount to INFINITY to second cloned resource > > > > * Pacemaker triggers an stop operation (without monitor > > operation > > > > failing) for the two resources in the node where the failcount > > has > > > > been set to INFINITY. > > > > * Reset failcount starts the two resources again > > > > > > > > Weirdly enough the second resource doesn't stop if I set the > > the > > > > the first resource failcount to INFINITY (not even the first > > > > resource stops...). > > > > > > > > But: > > > > * If I set the first resource as globally-unique=true it does > > not > > > > stop so somehow this breaks the relation. > > > > * If I manually set the failcount to 0 in the first resource > > that > > > > also breaks the relation so it does not stop either. It seems > > like > > > > the failcount value is being inherited from the second resource > > > > when it does not have any value. > > > > > > > > I must have something wrongly configuration but I can't really > > see > > > > why there is this relationship... > > > > > > > > Gerard > > > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > om> > > > > wrote: > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > > Thanks Ken. Yes, inspecting the logs seems that the > > failcount > > > > > of the > > > > > > correctly running resource reaches the maximum number of > > > > > allowed > > > > > > failures and gets banned in all nodes. > > > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > > first > > > > > > resource gets updated, is like the failcount are being > > mixed. > > > > > In > > > > > > fact, when the two resources get banned the only way I have > > to > > > > > make > > > > > > the first one start is to disable the failing one and clean > > the > > > > > > failcount of the two resources (it is not enough to only > > clean > > > > > the > > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > > > Gerard > > > > > > > > > > My suspicion is that you have two instances of the same > > service, > > > > > and > > > > > the resource agent monitor is only checking the general > > service, > > > > > rather > > > > > than a specific instance of it, so the monitors on both of > > them > > > > > return > > > > > failure if either one is failing. > > > > > > > > > > That would make sense why you have to disable the failing > > > > > resource, so > > > > > its monitor stops running. I can't think of why you'd have to > > > > > clean its > > > > > failcount for the other one to start, though. > > > > > > > > > > The "anything" agent very often causes more problems than it > > > > > solves ... > > > > > I'd recommend writing your own OCF agent tailored to your > > > > > service. > > > > > It's not much more complicated than an init script. > > > > > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > at.c > > > > > om> > > > > > > wrote: > > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything > > resources > > > > > each > > > > > > > one > > > > > > > > running as a clone in all nodes of the cluster. Fo
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the latest stable version in Centos 7.3 Gerard On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot wrote: > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > So I think I found the problem. The two resources are named forwarder > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > just that when I set the failcount to INFINITY to a resource named > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly > > affects the forwarder resource. > > > > If I change the name to forwarderbgp, the problem disappears. So it > > seems that the problem is that Pacemaker mixes the bgpforwarder and > > forwarder names. Is it a bug? > > > > Gerard > > That's really surprising. What version of pacemaker are you using? > There were a lot of changes in fail count handling in the last few > releases. > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > > wrote: > > > That makes sense. I've tried copying the anything resource and > > > changed its name and id (which I guess should be enough to make > > > pacemaker think they are different) but I still have the same > > > problem. > > > > > > After more debugging I have reduced the problem to this: > > > * First cloned resource running fine > > > * Second cloned resource running fine > > > * Manually set failcount to INFINITY to second cloned resource > > > * Pacemaker triggers an stop operation (without monitor operation > > > failing) for the two resources in the node where the failcount has > > > been set to INFINITY. > > > * Reset failcount starts the two resources again > > > > > > Weirdly enough the second resource doesn't stop if I set the the > > > the first resource failcount to INFINITY (not even the first > > > resource stops...). > > > > > > But: > > > * If I set the first resource as globally-unique=true it does not > > > stop so somehow this breaks the relation. > > > * If I manually set the failcount to 0 in the first resource that > > > also breaks the relation so it does not stop either. It seems like > > > the failcount value is being inherited from the second resource > > > when it does not have any value. > > > > > > I must have something wrongly configuration but I can't really see > > > why there is this relationship... > > > > > > Gerard > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > > > wrote: > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount > > > > of the > > > > > correctly running resource reaches the maximum number of > > > > allowed > > > > > failures and gets banned in all nodes. > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > first > > > > > resource gets updated, is like the failcount are being mixed. > > > > In > > > > > fact, when the two resources get banned the only way I have to > > > > make > > > > > the first one start is to disable the failing one and clean the > > > > > failcount of the two resources (it is not enough to only clean > > > > the > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > Gerard > > > > > > > > My suspicion is that you have two instances of the same service, > > > > and > > > > the resource agent monitor is only checking the general service, > > > > rather > > > > than a specific instance of it, so the monitors on both of them > > > > return > > > > failure if either one is failing. > > > > > > > > That would make sense why you have to disable the failing > > > > resource, so > > > > its monitor stops running. I can't think of why you'd have to > > > > clean its > > > > failcount for the other one to start, though. > > > > > > > > The "anything" agent very often causes more problems than it > > > > solves ... > > > > I'd recommend writing your own OCF agent tailored to your > > > > service. > > > > It's not much more complicated than an init script. > > > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > > > om> > > > > > wrote: > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources > > > > each > > > > > > one > > > > > > > running as a clone in all nodes of the cluster. For some > > > > reason > > > > > > when > > > > > > > one of them fails to start the other one stops. There is > > > > not any > > > > > > > constrain configured or any kind of relation between them. > > > > > > > > > > > > > > Is it possible that there is some kind of implicit relation > > > > that > > > > > > I'm > > > > > > > not aware of (for example because they are the same type?) > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Gerard > > > > > > > > > > > > There is no implicit relation on the Pacemaker side. However > > > > if the > > > > > > agent returns "failed" for both resources when either one > > > > fails, > > >
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > So I think I found the problem. The two resources are named forwarder > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > just that when I set the failcount to INFINITY to a resource named > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly > affects the forwarder resource. > > If I change the name to forwarderbgp, the problem disappears. So it > seems that the problem is that Pacemaker mixes the bgpforwarder and > forwarder names. Is it a bug? > > Gerard That's really surprising. What version of pacemaker are you using? There were a lot of changes in fail count handling in the last few releases. > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > wrote: > > That makes sense. I've tried copying the anything resource and > > changed its name and id (which I guess should be enough to make > > pacemaker think they are different) but I still have the same > > problem. > > > > After more debugging I have reduced the problem to this: > > * First cloned resource running fine > > * Second cloned resource running fine > > * Manually set failcount to INFINITY to second cloned resource > > * Pacemaker triggers an stop operation (without monitor operation > > failing) for the two resources in the node where the failcount has > > been set to INFINITY. > > * Reset failcount starts the two resources again > > > > Weirdly enough the second resource doesn't stop if I set the the > > the first resource failcount to INFINITY (not even the first > > resource stops...). > > > > But: > > * If I set the first resource as globally-unique=true it does not > > stop so somehow this breaks the relation. > > * If I manually set the failcount to 0 in the first resource that > > also breaks the relation so it does not stop either. It seems like > > the failcount value is being inherited from the second resource > > when it does not have any value. > > > > I must have something wrongly configuration but I can't really see > > why there is this relationship... > > > > Gerard > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > > wrote: > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount > > > of the > > > > correctly running resource reaches the maximum number of > > > allowed > > > > failures and gets banned in all nodes. > > > > > > > > What is weird is that I just see how the failcount for the > > > first > > > > resource gets updated, is like the failcount are being mixed. > > > In > > > > fact, when the two resources get banned the only way I have to > > > make > > > > the first one start is to disable the failing one and clean the > > > > failcount of the two resources (it is not enough to only clean > > > the > > > > failcount of the first resource) does it make sense? > > > > > > > > Gerard > > > > > > My suspicion is that you have two instances of the same service, > > > and > > > the resource agent monitor is only checking the general service, > > > rather > > > than a specific instance of it, so the monitors on both of them > > > return > > > failure if either one is failing. > > > > > > That would make sense why you have to disable the failing > > > resource, so > > > its monitor stops running. I can't think of why you'd have to > > > clean its > > > failcount for the other one to start, though. > > > > > > The "anything" agent very often causes more problems than it > > > solves ... > > > I'd recommend writing your own OCF agent tailored to your > > > service. > > > It's not much more complicated than an init script. > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > > om> > > > > wrote: > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > Hi, > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources > > > each > > > > > one > > > > > > running as a clone in all nodes of the cluster. For some > > > reason > > > > > when > > > > > > one of them fails to start the other one stops. There is > > > not any > > > > > > constrain configured or any kind of relation between them. > > > > > > > > > > > > Is it possible that there is some kind of implicit relation > > > that > > > > > I'm > > > > > > not aware of (for example because they are the same type?) > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Gerard > > > > > > > > > > There is no implicit relation on the Pacemaker side. However > > > if the > > > > > agent returns "failed" for both resources when either one > > > fails, > > > > > you > > > > > could see something like that. I'd look at the logs on the DC > > > and > > > > > see > > > > > why it decided to restart the second resource. > > > > > -- > > > > > Ken Gaillot > > > > > > > > > > ___ > > > > > Users mailing list: Users@clusterlabs.org > > > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > >
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
So I think I found the problem. The two resources are named forwarder and bgpforwarder. It doesn't matter if bgpforwarder exists. It is just that when I set the failcount to INFINITY to a resource named bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly affects the forwarder resource. If I change the name to forwarderbgp, the problem disappears. So it seems that the problem is that Pacemaker mixes the bgpforwarder and forwarder names. Is it a bug? Gerard On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia wrote: > That makes sense. I've tried copying the anything resource and changed its > name and id (which I guess should be enough to make pacemaker think they > are different) but I still have the same problem. > > After more debugging I have reduced the problem to this: > * First cloned resource running fine > * Second cloned resource running fine > * Manually set failcount to INFINITY to second cloned resource > * Pacemaker triggers an stop operation (without monitor operation failing) > for the two resources in the node where the failcount has been set to > INFINITY. > * Reset failcount starts the two resources again > > Weirdly enough the second resource doesn't stop if I set the the the first > resource failcount to INFINITY (not even the first resource stops...). > > But: > * If I set the first resource as globally-unique=true it does not stop so > somehow this breaks the relation. > * If I manually set the failcount to 0 in the first resource that also > breaks the relation so it does not stop either. It seems like the failcount > value is being inherited from the second resource when it does not have any > value. > > I must have something wrongly configuration but I can't really see why > there is this relationship... > > Gerard > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot wrote: > >> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: >> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the >> > correctly running resource reaches the maximum number of allowed >> > failures and gets banned in all nodes. >> > >> > What is weird is that I just see how the failcount for the first >> > resource gets updated, is like the failcount are being mixed. In >> > fact, when the two resources get banned the only way I have to make >> > the first one start is to disable the failing one and clean the >> > failcount of the two resources (it is not enough to only clean the >> > failcount of the first resource) does it make sense? >> > >> > Gerard >> >> My suspicion is that you have two instances of the same service, and >> the resource agent monitor is only checking the general service, rather >> than a specific instance of it, so the monitors on both of them return >> failure if either one is failing. >> >> That would make sense why you have to disable the failing resource, so >> its monitor stops running. I can't think of why you'd have to clean its >> failcount for the other one to start, though. >> >> The "anything" agent very often causes more problems than it solves ... >> I'd recommend writing your own OCF agent tailored to your service. >> It's not much more complicated than an init script. >> >> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot >> > wrote: >> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: >> > > > Hi, >> > > > >> > > > I have a cluster with two ocf:heartbeat:anything resources each >> > > one >> > > > running as a clone in all nodes of the cluster. For some reason >> > > when >> > > > one of them fails to start the other one stops. There is not any >> > > > constrain configured or any kind of relation between them. >> > > > >> > > > Is it possible that there is some kind of implicit relation that >> > > I'm >> > > > not aware of (for example because they are the same type?) >> > > > >> > > > Thanks, >> > > > >> > > > Gerard >> > > >> > > There is no implicit relation on the Pacemaker side. However if the >> > > agent returns "failed" for both resources when either one fails, >> > > you >> > > could see something like that. I'd look at the logs on the DC and >> > > see >> > > why it decided to restart the second resource. >> > > -- >> > > Ken Gaillot >> > > >> > > ___ >> > > Users mailing list: Users@clusterlabs.org >> > > http://lists.clusterlabs.org/mailman/listinfo/users >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc >> > > h.pdf >> > > Bugs: http://bugs.clusterlabs.org >> > > >> > >> > ___ >> > Users mailing list: Users@clusterlabs.org >> > http://lists.clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> > pdf >> > Bugs: http://bugs.clusterlabs.org >> -- >> Ken Gaillot >> >> ___ >> Users mailing list: Use
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
That makes sense. I've tried copying the anything resource and changed its name and id (which I guess should be enough to make pacemaker think they are different) but I still have the same problem. After more debugging I have reduced the problem to this: * First cloned resource running fine * Second cloned resource running fine * Manually set failcount to INFINITY to second cloned resource * Pacemaker triggers an stop operation (without monitor operation failing) for the two resources in the node where the failcount has been set to INFINITY. * Reset failcount starts the two resources again Weirdly enough the second resource doesn't stop if I set the the the first resource failcount to INFINITY (not even the first resource stops...). But: * If I set the first resource as globally-unique=true it does not stop so somehow this breaks the relation. * If I manually set the failcount to 0 in the first resource that also breaks the relation so it does not stop either. It seems like the failcount value is being inherited from the second resource when it does not have any value. I must have something wrongly configuration but I can't really see why there is this relationship... Gerard On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot wrote: > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > Thanks Ken. Yes, inspecting the logs seems that the failcount of the > > correctly running resource reaches the maximum number of allowed > > failures and gets banned in all nodes. > > > > What is weird is that I just see how the failcount for the first > > resource gets updated, is like the failcount are being mixed. In > > fact, when the two resources get banned the only way I have to make > > the first one start is to disable the failing one and clean the > > failcount of the two resources (it is not enough to only clean the > > failcount of the first resource) does it make sense? > > > > Gerard > > My suspicion is that you have two instances of the same service, and > the resource agent monitor is only checking the general service, rather > than a specific instance of it, so the monitors on both of them return > failure if either one is failing. > > That would make sense why you have to disable the failing resource, so > its monitor stops running. I can't think of why you'd have to clean its > failcount for the other one to start, though. > > The "anything" agent very often causes more problems than it solves ... > I'd recommend writing your own OCF agent tailored to your service. > It's not much more complicated than an init script. > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > > wrote: > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > Hi, > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources each > > > one > > > > running as a clone in all nodes of the cluster. For some reason > > > when > > > > one of them fails to start the other one stops. There is not any > > > > constrain configured or any kind of relation between them. > > > > > > > > Is it possible that there is some kind of implicit relation that > > > I'm > > > > not aware of (for example because they are the same type?) > > > > > > > > Thanks, > > > > > > > > Gerard > > > > > > There is no implicit relation on the Pacemaker side. However if the > > > agent returns "failed" for both resources when either one fails, > > > you > > > could see something like that. I'd look at the logs on the DC and > > > see > > > why it decided to restart the second resource. > > > -- > > > Ken Gaillot > > > > > > ___ > > > Users mailing list: Users@clusterlabs.org > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > > h.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > > pdf > > Bugs: http://bugs.clusterlabs.org > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > Thanks Ken. Yes, inspecting the logs seems that the failcount of the > correctly running resource reaches the maximum number of allowed > failures and gets banned in all nodes. > > What is weird is that I just see how the failcount for the first > resource gets updated, is like the failcount are being mixed. In > fact, when the two resources get banned the only way I have to make > the first one start is to disable the failing one and clean the > failcount of the two resources (it is not enough to only clean the > failcount of the first resource) does it make sense? > > Gerard My suspicion is that you have two instances of the same service, and the resource agent monitor is only checking the general service, rather than a specific instance of it, so the monitors on both of them return failure if either one is failing. That would make sense why you have to disable the failing resource, so its monitor stops running. I can't think of why you'd have to clean its failcount for the other one to start, though. The "anything" agent very often causes more problems than it solves ... I'd recommend writing your own OCF agent tailored to your service. It's not much more complicated than an init script. > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > wrote: > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > Hi, > > > > > > I have a cluster with two ocf:heartbeat:anything resources each > > one > > > running as a clone in all nodes of the cluster. For some reason > > when > > > one of them fails to start the other one stops. There is not any > > > constrain configured or any kind of relation between them. > > > > > > Is it possible that there is some kind of implicit relation that > > I'm > > > not aware of (for example because they are the same type?) > > > > > > Thanks, > > > > > > Gerard > > > > There is no implicit relation on the Pacemaker side. However if the > > agent returns "failed" for both resources when either one fails, > > you > > could see something like that. I'd look at the logs on the DC and > > see > > why it decided to restart the second resource. > > -- > > Ken Gaillot > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
Thanks Ken. Yes, inspecting the logs seems that the failcount of the correctly running resource reaches the maximum number of allowed failures and gets banned in all nodes. What is weird is that I just see how the failcount for the first resource gets updated, is like the failcount are being mixed. In fact, when the two resources get banned the only way I have to make the first one start is to disable the failing one and clean the failcount of the two resources (it is not enough to only clean the failcount of the first resource) does it make sense? Gerard On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot wrote: > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > Hi, > > > > I have a cluster with two ocf:heartbeat:anything resources each one > > running as a clone in all nodes of the cluster. For some reason when > > one of them fails to start the other one stops. There is not any > > constrain configured or any kind of relation between them. > > > > Is it possible that there is some kind of implicit relation that I'm > > not aware of (for example because they are the same type?) > > > > Thanks, > > > > Gerard > > There is no implicit relation on the Pacemaker side. However if the > agent returns "failed" for both resources when either one fails, you > could see something like that. I'd look at the logs on the DC and see > why it decided to restart the second resource. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > Hi, > > I have a cluster with two ocf:heartbeat:anything resources each one > running as a clone in all nodes of the cluster. For some reason when > one of them fails to start the other one stops. There is not any > constrain configured or any kind of relation between them. > > Is it possible that there is some kind of implicit relation that I'm > not aware of (for example because they are the same type?) > > Thanks, > > Gerard There is no implicit relation on the Pacemaker side. However if the agent returns "failed" for both resources when either one fails, you could see something like that. I'd look at the logs on the DC and see why it decided to restart the second resource. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] When resource fails to start it stops an apparently unrelated resource
Hi, I have a cluster with two ocf:heartbeat:anything resources each one running as a clone in all nodes of the cluster. For some reason when one of them fails to start the other one stops. There is not any constrain configured or any kind of relation between them. Is it possible that there is some kind of implicit relation that I'm not aware of (for example because they are the same type?) Thanks, Gerard ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org