Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-19 Thread Gerard Garcia
I'm so lucky :) thanks for your help!

Gerard

On Thu, Oct 19, 2017 at 12:04 AM, Ken Gaillot  wrote:

> On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote:
> > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the
> > latest stable version in Centos 7.3
> >
> > Gerard
>
> Interesting ... this was an undetected bug that was coincidentally
> fixed by the recent fail-count work released in 1.1.17. The bug only
> affected cloned resources where one clone's name ended with the
> other's.
>
> FYI, CentOS 7.4 has 1.1.16, but that won't help this issue.
>
> >
> > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot 
> > wrote:
> > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > > > So I think I found the problem. The two resources are named
> > > forwarder
> > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > > > just that when I set the failcount to INFINITY to a resource
> > > named
> > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it
> > > directly
> > > > affects the forwarder resource.
> > > >
> > > > If I change the name to forwarderbgp, the problem disappears. So
> > > it
> > > > seems that the problem is that Pacemaker mixes the bgpforwarder
> > > and
> > > > forwarder names. Is it a bug?
> > > >
> > > > Gerard
> > >
> > > That's really surprising. What version of pacemaker are you using?
> > > There were a lot of changes in fail count handling in the last few
> > > releases.
> > >
> > > >
> > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia 
> > > > wrote:
> > > > > That makes sense. I've tried copying the anything resource and
> > > > > changed its name and id (which I guess should be enough to make
> > > > > pacemaker think they are different) but I still have the same
> > > > > problem.
> > > > >
> > > > > After more debugging I have reduced the problem to this:
> > > > > * First cloned resource running fine
> > > > > * Second cloned resource running fine
> > > > > * Manually set failcount to INFINITY to second cloned resource
> > > > > * Pacemaker triggers an stop operation (without monitor
> > > operation
> > > > > failing) for the two resources in the node where the failcount
> > > has
> > > > > been set to INFINITY.
> > > > > * Reset failcount starts the two resources again
> > > > >
> > > > > Weirdly enough the second resource doesn't stop if I set the
> > > the
> > > > > the first resource failcount to INFINITY (not even the first
> > > > > resource stops...).
> > > > >
> > > > > But:
> > > > > * If I set the first resource as globally-unique=true it does
> > > not
> > > > > stop so somehow this breaks the relation.
> > > > > * If I manually set the failcount to 0 in the first resource
> > > that
> > > > > also breaks the relation so it does not stop either. It seems
> > > like
> > > > > the failcount value is being inherited from the second resource
> > > > > when it does not have any value.
> > > > >
> > > > > I must have something wrongly configuration but I can't really
> > > see
> > > > > why there is this relationship...
> > > > >
> > > > > Gerard
> > > > >
> > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot  > > om>
> > > > > wrote:
> > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > > > Thanks Ken. Yes, inspecting the logs seems that the
> > > failcount
> > > > > > of the
> > > > > > > correctly running resource reaches the maximum number of
> > > > > > allowed
> > > > > > > failures and gets banned in all nodes.
> > > > > > >
> > > > > > > What is weird is that I just see how the failcount for the
> > > > > > first
> > > > > > > resource gets updated, is like the failcount are being
> > > mixed.
> > > > > > In
> > > > > > > fact, when the two resources get banned the only way I have
> > > to
> > > > > > make
> > > > > > > the first one start is to disable the failing one and clean
> > > the
> > > > > > > failcount of the two resources (it is not enough to only
> > > clean
> > > > > > the
> > > > > > > failcount of the first resource) does it make sense?
> > > > > > >
> > > > > > > Gerard
> > > > > >
> > > > > > My suspicion is that you have two instances of the same
> > > service,
> > > > > > and
> > > > > > the resource agent monitor is only checking the general
> > > service,
> > > > > > rather
> > > > > > than a specific instance of it, so the monitors on both of
> > > them
> > > > > > return
> > > > > > failure if either one is failing.
> > > > > >
> > > > > > That would make sense why you have to disable the failing
> > > > > > resource, so
> > > > > > its monitor stops running. I can't think of why you'd have to
> > > > > > clean its
> > > > > > failcount for the other one to start, though.
> > > > > >
> > > > > > The "anything" agent very often causes more problems than it
> > > > > > solves ...
> > > > > >  I'd recommend writing your own OCF agent tailored to your
> > > > > > service.
> > > > > > It's not much more complicated than an init script.
> > > > > >
> > > > > > > On Mon, Oct 16, 

Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-18 Thread Ken Gaillot
On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote:
> I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the
> latest stable version in Centos 7.3
> 
> Gerard

Interesting ... this was an undetected bug that was coincidentally
fixed by the recent fail-count work released in 1.1.17. The bug only
affected cloned resources where one clone's name ended with the
other's.

FYI, CentOS 7.4 has 1.1.16, but that won't help this issue.

> 
> On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot 
> wrote:
> > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > > So I think I found the problem. The two resources are named
> > forwarder
> > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > > just that when I set the failcount to INFINITY to a resource
> > named
> > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it
> > directly
> > > affects the forwarder resource. 
> > >
> > > If I change the name to forwarderbgp, the problem disappears. So
> > it
> > > seems that the problem is that Pacemaker mixes the bgpforwarder
> > and
> > > forwarder names. Is it a bug?
> > >
> > > Gerard
> > 
> > That's really surprising. What version of pacemaker are you using?
> > There were a lot of changes in fail count handling in the last few
> > releases.
> > 
> > >
> > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia 
> > > wrote:
> > > > That makes sense. I've tried copying the anything resource and
> > > > changed its name and id (which I guess should be enough to make
> > > > pacemaker think they are different) but I still have the same
> > > > problem.
> > > >
> > > > After more debugging I have reduced the problem to this:
> > > > * First cloned resource running fine
> > > > * Second cloned resource running fine
> > > > * Manually set failcount to INFINITY to second cloned resource
> > > > * Pacemaker triggers an stop operation (without monitor
> > operation
> > > > failing) for the two resources in the node where the failcount
> > has
> > > > been set to INFINITY.
> > > > * Reset failcount starts the two resources again
> > > >
> > > > Weirdly enough the second resource doesn't stop if I set the
> > the
> > > > the first resource failcount to INFINITY (not even the first
> > > > resource stops...). 
> > > >
> > > > But:
> > > > * If I set the first resource as globally-unique=true it does
> > not
> > > > stop so somehow this breaks the relation.
> > > > * If I manually set the failcount to 0 in the first resource
> > that
> > > > also breaks the relation so it does not stop either. It seems
> > like
> > > > the failcount value is being inherited from the second resource
> > > > when it does not have any value. 
> > > >
> > > > I must have something wrongly configuration but I can't really
> > see
> > > > why there is this relationship...
> > > >
> > > > Gerard
> > > >
> > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot  > om>
> > > > wrote:
> > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > > Thanks Ken. Yes, inspecting the logs seems that the
> > failcount
> > > > > of the
> > > > > > correctly running resource reaches the maximum number of
> > > > > allowed
> > > > > > failures and gets banned in all nodes.
> > > > > >
> > > > > > What is weird is that I just see how the failcount for the
> > > > > first
> > > > > > resource gets updated, is like the failcount are being
> > mixed.
> > > > > In
> > > > > > fact, when the two resources get banned the only way I have
> > to
> > > > > make
> > > > > > the first one start is to disable the failing one and clean
> > the
> > > > > > failcount of the two resources (it is not enough to only
> > clean
> > > > > the
> > > > > > failcount of the first resource) does it make sense?
> > > > > >
> > > > > > Gerard
> > > > >
> > > > > My suspicion is that you have two instances of the same
> > service,
> > > > > and
> > > > > the resource agent monitor is only checking the general
> > service,
> > > > > rather
> > > > > than a specific instance of it, so the monitors on both of
> > them
> > > > > return
> > > > > failure if either one is failing.
> > > > >
> > > > > That would make sense why you have to disable the failing
> > > > > resource, so
> > > > > its monitor stops running. I can't think of why you'd have to
> > > > > clean its
> > > > > failcount for the other one to start, though.
> > > > >
> > > > > The "anything" agent very often causes more problems than it
> > > > > solves ...
> > > > >  I'd recommend writing your own OCF agent tailored to your
> > > > > service.
> > > > > It's not much more complicated than an init script.
> > > > >
> > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot  > at.c
> > > > > om>
> > > > > > wrote:
> > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have a cluster with two ocf:heartbeat:anything
> > resources
> > > > > each
> > > > > > > one
> > > > > > > > running as a clone in all nodes of the cluster. Fo

Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-18 Thread Gerard Garcia
I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the latest
stable version in Centos 7.3

Gerard

On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot  wrote:

> On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> > So I think I found the problem. The two resources are named forwarder
> > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> > just that when I set the failcount to INFINITY to a resource named
> > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly
> > affects the forwarder resource.
> >
> > If I change the name to forwarderbgp, the problem disappears. So it
> > seems that the problem is that Pacemaker mixes the bgpforwarder and
> > forwarder names. Is it a bug?
> >
> > Gerard
>
> That's really surprising. What version of pacemaker are you using?
> There were a lot of changes in fail count handling in the last few
> releases.
>
> >
> > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia 
> > wrote:
> > > That makes sense. I've tried copying the anything resource and
> > > changed its name and id (which I guess should be enough to make
> > > pacemaker think they are different) but I still have the same
> > > problem.
> > >
> > > After more debugging I have reduced the problem to this:
> > > * First cloned resource running fine
> > > * Second cloned resource running fine
> > > * Manually set failcount to INFINITY to second cloned resource
> > > * Pacemaker triggers an stop operation (without monitor operation
> > > failing) for the two resources in the node where the failcount has
> > > been set to INFINITY.
> > > * Reset failcount starts the two resources again
> > >
> > > Weirdly enough the second resource doesn't stop if I set the the
> > > the first resource failcount to INFINITY (not even the first
> > > resource stops...).
> > >
> > > But:
> > > * If I set the first resource as globally-unique=true it does not
> > > stop so somehow this breaks the relation.
> > > * If I manually set the failcount to 0 in the first resource that
> > > also breaks the relation so it does not stop either. It seems like
> > > the failcount value is being inherited from the second resource
> > > when it does not have any value.
> > >
> > > I must have something wrongly configuration but I can't really see
> > > why there is this relationship...
> > >
> > > Gerard
> > >
> > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot 
> > > wrote:
> > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount
> > > > of the
> > > > > correctly running resource reaches the maximum number of
> > > > allowed
> > > > > failures and gets banned in all nodes.
> > > > >
> > > > > What is weird is that I just see how the failcount for the
> > > > first
> > > > > resource gets updated, is like the failcount are being mixed.
> > > > In
> > > > > fact, when the two resources get banned the only way I have to
> > > > make
> > > > > the first one start is to disable the failing one and clean the
> > > > > failcount of the two resources (it is not enough to only clean
> > > > the
> > > > > failcount of the first resource) does it make sense?
> > > > >
> > > > > Gerard
> > > >
> > > > My suspicion is that you have two instances of the same service,
> > > > and
> > > > the resource agent monitor is only checking the general service,
> > > > rather
> > > > than a specific instance of it, so the monitors on both of them
> > > > return
> > > > failure if either one is failing.
> > > >
> > > > That would make sense why you have to disable the failing
> > > > resource, so
> > > > its monitor stops running. I can't think of why you'd have to
> > > > clean its
> > > > failcount for the other one to start, though.
> > > >
> > > > The "anything" agent very often causes more problems than it
> > > > solves ...
> > > >  I'd recommend writing your own OCF agent tailored to your
> > > > service.
> > > > It's not much more complicated than an init script.
> > > >
> > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot  > > > om>
> > > > > wrote:
> > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have a cluster with two ocf:heartbeat:anything resources
> > > > each
> > > > > > one
> > > > > > > running as a clone in all nodes of the cluster. For some
> > > > reason
> > > > > > when
> > > > > > > one of them fails to start the other one stops. There is
> > > > not any
> > > > > > > constrain configured or any kind of relation between them.
> > > > > > >
> > > > > > > Is it possible that there is some kind of implicit relation
> > > > that
> > > > > > I'm
> > > > > > > not aware of (for example because they are the same type?)
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Gerard
> > > > > >
> > > > > > There is no implicit relation on the Pacemaker side. However
> > > > if the
> > > > > > agent returns "failed" for both resources when either one
> > > > fails,
> > > 

Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-18 Thread Ken Gaillot
On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote:
> So I think I found the problem. The two resources are named forwarder
> and bgpforwarder. It doesn't matter if bgpforwarder exists. It is
> just that when I set the failcount to INFINITY to a resource named
> bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly
> affects the forwarder resource. 
> 
> If I change the name to forwarderbgp, the problem disappears. So it
> seems that the problem is that Pacemaker mixes the bgpforwarder and
> forwarder names. Is it a bug?
> 
> Gerard

That's really surprising. What version of pacemaker are you using?
There were a lot of changes in fail count handling in the last few
releases.

> 
> On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia 
> wrote:
> > That makes sense. I've tried copying the anything resource and
> > changed its name and id (which I guess should be enough to make
> > pacemaker think they are different) but I still have the same
> > problem.
> > 
> > After more debugging I have reduced the problem to this:
> > * First cloned resource running fine
> > * Second cloned resource running fine
> > * Manually set failcount to INFINITY to second cloned resource
> > * Pacemaker triggers an stop operation (without monitor operation
> > failing) for the two resources in the node where the failcount has
> > been set to INFINITY.
> > * Reset failcount starts the two resources again
> > 
> > Weirdly enough the second resource doesn't stop if I set the the
> > the first resource failcount to INFINITY (not even the first
> > resource stops...). 
> > 
> > But:
> > * If I set the first resource as globally-unique=true it does not
> > stop so somehow this breaks the relation.
> > * If I manually set the failcount to 0 in the first resource that
> > also breaks the relation so it does not stop either. It seems like
> > the failcount value is being inherited from the second resource
> > when it does not have any value. 
> > 
> > I must have something wrongly configuration but I can't really see
> > why there is this relationship...
> > 
> > Gerard
> > 
> > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot 
> > wrote:
> > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > > > Thanks Ken. Yes, inspecting the logs seems that the failcount
> > > of the
> > > > correctly running resource reaches the maximum number of
> > > allowed
> > > > failures and gets banned in all nodes.
> > > >
> > > > What is weird is that I just see how the failcount for the
> > > first
> > > > resource gets updated, is like the failcount are being mixed.
> > > In
> > > > fact, when the two resources get banned the only way I have to
> > > make
> > > > the first one start is to disable the failing one and clean the
> > > > failcount of the two resources (it is not enough to only clean
> > > the
> > > > failcount of the first resource) does it make sense?
> > > >
> > > > Gerard
> > > 
> > > My suspicion is that you have two instances of the same service,
> > > and
> > > the resource agent monitor is only checking the general service,
> > > rather
> > > than a specific instance of it, so the monitors on both of them
> > > return
> > > failure if either one is failing.
> > > 
> > > That would make sense why you have to disable the failing
> > > resource, so
> > > its monitor stops running. I can't think of why you'd have to
> > > clean its
> > > failcount for the other one to start, though.
> > > 
> > > The "anything" agent very often causes more problems than it
> > > solves ...
> > >  I'd recommend writing your own OCF agent tailored to your
> > > service.
> > > It's not much more complicated than an init script.
> > > 
> > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot  > > om>
> > > > wrote:
> > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I have a cluster with two ocf:heartbeat:anything resources
> > > each
> > > > > one
> > > > > > running as a clone in all nodes of the cluster. For some
> > > reason
> > > > > when
> > > > > > one of them fails to start the other one stops. There is
> > > not any
> > > > > > constrain configured or any kind of relation between them. 
> > > > > >
> > > > > > Is it possible that there is some kind of implicit relation
> > > that
> > > > > I'm
> > > > > > not aware of (for example because they are the same type?)
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Gerard
> > > > >
> > > > > There is no implicit relation on the Pacemaker side. However
> > > if the
> > > > > agent returns "failed" for both resources when either one
> > > fails,
> > > > > you
> > > > > could see something like that. I'd look at the logs on the DC
> > > and
> > > > > see
> > > > > why it decided to restart the second resource.
> > > > > --
> > > > > Ken Gaillot 
> > > > >
> > > > > ___
> > > > > Users mailing list: Users@clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
>

Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-18 Thread Gerard Garcia
So I think I found the problem. The two resources are named forwarder and
bgpforwarder. It doesn't matter if bgpforwarder exists. It is just that
when I set the failcount to INFINITY to a resource named bgpforwarder
(crm_failcount -r bgpforwarder -v INFINITY) it directly affects the
forwarder resource.

If I change the name to forwarderbgp, the problem disappears. So it seems
that the problem is that Pacemaker mixes the bgpforwarder and forwarder
names. Is it a bug?

Gerard

On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia  wrote:

> That makes sense. I've tried copying the anything resource and changed its
> name and id (which I guess should be enough to make pacemaker think they
> are different) but I still have the same problem.
>
> After more debugging I have reduced the problem to this:
> * First cloned resource running fine
> * Second cloned resource running fine
> * Manually set failcount to INFINITY to second cloned resource
> * Pacemaker triggers an stop operation (without monitor operation failing)
> for the two resources in the node where the failcount has been set to
> INFINITY.
> * Reset failcount starts the two resources again
>
> Weirdly enough the second resource doesn't stop if I set the the the first
> resource failcount to INFINITY (not even the first resource stops...).
>
> But:
> * If I set the first resource as globally-unique=true it does not stop so
> somehow this breaks the relation.
> * If I manually set the failcount to 0 in the first resource that also
> breaks the relation so it does not stop either. It seems like the failcount
> value is being inherited from the second resource when it does not have any
> value.
>
> I must have something wrongly configuration but I can't really see why
> there is this relationship...
>
> Gerard
>
> On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot  wrote:
>
>> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
>> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the
>> > correctly running resource reaches the maximum number of allowed
>> > failures and gets banned in all nodes.
>> >
>> > What is weird is that I just see how the failcount for the first
>> > resource gets updated, is like the failcount are being mixed. In
>> > fact, when the two resources get banned the only way I have to make
>> > the first one start is to disable the failing one and clean the
>> > failcount of the two resources (it is not enough to only clean the
>> > failcount of the first resource) does it make sense?
>> >
>> > Gerard
>>
>> My suspicion is that you have two instances of the same service, and
>> the resource agent monitor is only checking the general service, rather
>> than a specific instance of it, so the monitors on both of them return
>> failure if either one is failing.
>>
>> That would make sense why you have to disable the failing resource, so
>> its monitor stops running. I can't think of why you'd have to clean its
>> failcount for the other one to start, though.
>>
>> The "anything" agent very often causes more problems than it solves ...
>>  I'd recommend writing your own OCF agent tailored to your service.
>> It's not much more complicated than an init script.
>>
>> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot 
>> > wrote:
>> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
>> > > > Hi,
>> > > >
>> > > > I have a cluster with two ocf:heartbeat:anything resources each
>> > > one
>> > > > running as a clone in all nodes of the cluster. For some reason
>> > > when
>> > > > one of them fails to start the other one stops. There is not any
>> > > > constrain configured or any kind of relation between them.
>> > > >
>> > > > Is it possible that there is some kind of implicit relation that
>> > > I'm
>> > > > not aware of (for example because they are the same type?)
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Gerard
>> > >
>> > > There is no implicit relation on the Pacemaker side. However if the
>> > > agent returns "failed" for both resources when either one fails,
>> > > you
>> > > could see something like that. I'd look at the logs on the DC and
>> > > see
>> > > why it decided to restart the second resource.
>> > > --
>> > > Ken Gaillot 
>> > >
>> > > ___
>> > > Users mailing list: Users@clusterlabs.org
>> > > http://lists.clusterlabs.org/mailman/listinfo/users
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>> > > h.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> > >
>> >
>> > ___
>> > Users mailing list: Users@clusterlabs.org
>> > http://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot 
>>
>> ___
>> Users mailing list: Use

Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-17 Thread Gerard Garcia
That makes sense. I've tried copying the anything resource and changed its
name and id (which I guess should be enough to make pacemaker think they
are different) but I still have the same problem.

After more debugging I have reduced the problem to this:
* First cloned resource running fine
* Second cloned resource running fine
* Manually set failcount to INFINITY to second cloned resource
* Pacemaker triggers an stop operation (without monitor operation failing)
for the two resources in the node where the failcount has been set to
INFINITY.
* Reset failcount starts the two resources again

Weirdly enough the second resource doesn't stop if I set the the the first
resource failcount to INFINITY (not even the first resource stops...).

But:
* If I set the first resource as globally-unique=true it does not stop so
somehow this breaks the relation.
* If I manually set the failcount to 0 in the first resource that also
breaks the relation so it does not stop either. It seems like the failcount
value is being inherited from the second resource when it does not have any
value.

I must have something wrongly configuration but I can't really see why
there is this relationship...

Gerard

On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot  wrote:

> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the
> > correctly running resource reaches the maximum number of allowed
> > failures and gets banned in all nodes.
> >
> > What is weird is that I just see how the failcount for the first
> > resource gets updated, is like the failcount are being mixed. In
> > fact, when the two resources get banned the only way I have to make
> > the first one start is to disable the failing one and clean the
> > failcount of the two resources (it is not enough to only clean the
> > failcount of the first resource) does it make sense?
> >
> > Gerard
>
> My suspicion is that you have two instances of the same service, and
> the resource agent monitor is only checking the general service, rather
> than a specific instance of it, so the monitors on both of them return
> failure if either one is failing.
>
> That would make sense why you have to disable the failing resource, so
> its monitor stops running. I can't think of why you'd have to clean its
> failcount for the other one to start, though.
>
> The "anything" agent very often causes more problems than it solves ...
>  I'd recommend writing your own OCF agent tailored to your service.
> It's not much more complicated than an init script.
>
> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot 
> > wrote:
> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > > Hi,
> > > >
> > > > I have a cluster with two ocf:heartbeat:anything resources each
> > > one
> > > > running as a clone in all nodes of the cluster. For some reason
> > > when
> > > > one of them fails to start the other one stops. There is not any
> > > > constrain configured or any kind of relation between them.
> > > >
> > > > Is it possible that there is some kind of implicit relation that
> > > I'm
> > > > not aware of (for example because they are the same type?)
> > > >
> > > > Thanks,
> > > >
> > > > Gerard
> > >
> > > There is no implicit relation on the Pacemaker side. However if the
> > > agent returns "failed" for both resources when either one fails,
> > > you
> > > could see something like that. I'd look at the logs on the DC and
> > > see
> > > why it decided to restart the second resource.
> > > --
> > > Ken Gaillot 
> > >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > > h.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot 
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-17 Thread Ken Gaillot
On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote:
> Thanks Ken. Yes, inspecting the logs seems that the failcount of the
> correctly running resource reaches the maximum number of allowed
> failures and gets banned in all nodes.
> 
> What is weird is that I just see how the failcount for the first
> resource gets updated, is like the failcount are being mixed. In
> fact, when the two resources get banned the only way I have to make
> the first one start is to disable the failing one and clean the
> failcount of the two resources (it is not enough to only clean the
> failcount of the first resource) does it make sense?
> 
> Gerard

My suspicion is that you have two instances of the same service, and
the resource agent monitor is only checking the general service, rather
than a specific instance of it, so the monitors on both of them return
failure if either one is failing.

That would make sense why you have to disable the failing resource, so
its monitor stops running. I can't think of why you'd have to clean its
failcount for the other one to start, though.

The "anything" agent very often causes more problems than it solves ...
 I'd recommend writing your own OCF agent tailored to your service.
It's not much more complicated than an init script.

> On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot 
> wrote:
> > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > > Hi,
> > >
> > > I have a cluster with two ocf:heartbeat:anything resources each
> > one
> > > running as a clone in all nodes of the cluster. For some reason
> > when
> > > one of them fails to start the other one stops. There is not any
> > > constrain configured or any kind of relation between them. 
> > >
> > > Is it possible that there is some kind of implicit relation that
> > I'm
> > > not aware of (for example because they are the same type?)
> > >
> > > Thanks,
> > >
> > > Gerard
> > 
> > There is no implicit relation on the Pacemaker side. However if the
> > agent returns "failed" for both resources when either one fails,
> > you
> > could see something like that. I'd look at the logs on the DC and
> > see
> > why it decided to restart the second resource.
> > --
> > Ken Gaillot 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-17 Thread Gerard Garcia
Thanks Ken. Yes, inspecting the logs seems that the failcount of the
correctly running resource reaches the maximum number of allowed failures
and gets banned in all nodes.

What is weird is that I just see how the failcount for the first resource
gets updated, is like the failcount are being mixed. In fact, when the two
resources get banned the only way I have to make the first one start is to
disable the failing one and clean the failcount of the two resources (it is
not enough to only clean the failcount of the first resource) does it make
sense?

Gerard

On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot  wrote:

> On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> > Hi,
> >
> > I have a cluster with two ocf:heartbeat:anything resources each one
> > running as a clone in all nodes of the cluster. For some reason when
> > one of them fails to start the other one stops. There is not any
> > constrain configured or any kind of relation between them.
> >
> > Is it possible that there is some kind of implicit relation that I'm
> > not aware of (for example because they are the same type?)
> >
> > Thanks,
> >
> > Gerard
>
> There is no implicit relation on the Pacemaker side. However if the
> agent returns "failed" for both resources when either one fails, you
> could see something like that. I'd look at the logs on the DC and see
> why it decided to restart the second resource.
> --
> Ken Gaillot 
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-16 Thread Ken Gaillot
On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote:
> Hi,
> 
> I have a cluster with two ocf:heartbeat:anything resources each one
> running as a clone in all nodes of the cluster. For some reason when
> one of them fails to start the other one stops. There is not any
> constrain configured or any kind of relation between them. 
> 
> Is it possible that there is some kind of implicit relation that I'm
> not aware of (for example because they are the same type?)
> 
> Thanks,
> 
> Gerard

There is no implicit relation on the Pacemaker side. However if the
agent returns "failed" for both resources when either one fails, you
could see something like that. I'd look at the logs on the DC and see
why it decided to restart the second resource.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] When resource fails to start it stops an apparently unrelated resource

2017-10-16 Thread Gerard Garcia
Hi,

I have a cluster with two ocf:heartbeat:anything resources each one running
as a clone in all nodes of the cluster. For some reason when one of them
fails to start the other one stops. There is not any constrain configured
or any kind of relation between them.

Is it possible that there is some kind of implicit relation that I'm not
aware of (for example because they are the same type?)

Thanks,

Gerard
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org