Re: [ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 09:09 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 27.06.2018 um
> > > > 16:18 in Nachricht
> 
> <1530109097.6452.1.ca...@redhat.com>:
> > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
> > > > > > Ken Gaillot  schrieb am 26.06.2018 um
> > > > > > 18:22 in Nachricht
> > > 
> > > <1530030128.5202.5.ca...@redhat.com>:
> > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> > > > > 26.06.2018 09:14, Ulrich Windl wrote:
> > > > > > Hi!
> > > > > > 
> > > > > > We just observed some strange effect we cannot explain in
> > > > > > SLES
> > > > > > 11
> > > > > > SP4 (pacemaker 1.1.12-f47ea56):
> > > > > > We run about a dozen of Xen PVMs on a three-node cluster
> > > > > > (plus
> > > > > > some
> > > > > > infrastructure and monitoring stuff). It worked all well so
> > > > > > far,
> > > > > > and there was no significant change recently.
> > > > > > However when a colleague stopped on VM for maintenance via
> > > > > > cluster
> > > > > > command, the cluster did not notice when the PVM actually
> > > > > > was
> > > > > > running again (it had been started not using the cluster (a
> > > > > > bad
> > > > > > idea, I know)).
> > > > > 
> > > > > To be on a safe side in such cases you'd probably want to
> > > > > enable 
> > > > > additional monitor for a "Stopped" role. Default one covers
> > > > > only 
> > > > > "Started" role. The same thing as for multistate resources,
> > > > > where
> > > > > you 
> > > > > need several monitor ops, for "Started/Slave" and "Master"
> > > > > roles.
> > > > > But, this will increase a load.
> > > > > And, I believe cluster should reprobe a resource on all nodes
> > > > > once
> > > > > you 
> > > > > change target-role back to "Started".
> > > > 
> > > > Which raises the question, how did you stop the VM initially?
> > > 
> > > I thought "(...) stopped one VM for maintenance via cluster
> > > command"
> > > is obvious. It was something like "crm resource stop ...".
> > > 
> > > > 
> > > > If you stopped it by setting target-role to Stopped, likely the
> > > > cluster
> > > > still thinks it's stopped, and you need to set it to Started
> > > > again.
> > > > If
> > > > instead you set maintenance mode or unmanaged the resource,
> > > > then
> > > > stopped the VM manually, then most likely it's still in that
> > > > mode
> > > > and
> > > > needs to be taken out of it.
> > > 
> > > The point was when the command to start the resource was given,
> > > the
> > > cluster had completely ignored the fact that it was running
> > > already
> > > and started to start the VM on a second node (which may be
> > > desastrous). But that's leading away from the main question...
> > 
> > Ah, this is expected behavior when you start a resource manually,
> > and
> > there are no monitors with target-role=Stopped. If the node where
> > you
> > manually started the VM isn't the same node the cluster happens to
> > choose, then you can get multiple active instances.
> > 
> > By default, the cluster assumes that where a probe found a resource
> > to
> > be not running, that resource will stay not running unless started
> > by
> > the cluster. (It will re-probe if the node goes away and comes
> > back.)
> 
> But didn't this behavior change? I tohought it was different maybe a
> year ago or so.

Not that I know of. We have fixed some issues around probes, especially
around probing Pacemaker Remote connections and the resources running
on those nodes, and around ordering of various actions with probes.

> > If you wish to guard against resources being started outside
> > cluster
> > control, configure a recurring monitor with target-role=Stopped,
> > and
> > the cluster will run that on all nodes where it thinks the resource
> > is
> > not supposed to be running. Of course since it has to poll at
> > intervals, it can take up to that much time to detect a manually
> > started instance.
> 
> Did monitor roles exist always, or were those added some time ago?

They've always been around. Stopped is not commonly used, but separate
monitors for Master and Slave roles are commonplace.

> > 
> > > > > > Examining the logs, it seems that the recheck timer popped
> > > > > > periodically, but no monitor action was run for the VM (the
> > > > > > action
> > > > > > is configured to run every 10 minutes).
> > 
> > Recurring monitors are only recorded in the log if their return
> > value
> > changed. If there are 10 successful monitors in a row and then a
> > failure, only the first success and the failure are logged.
> 
> OK, din't know that.
> 
> 
> Thanks a lot for the explanations!
> 
> Regards,
> Ulrich
> > 
> > > > > > 
> > > > > > Actually the only monitor operations found were:
> > > > > > May 23 08:04:13
> > > > > > Jun 13 08:13:03
> > > > > > Jun 25 09:29:04
> > > > > > Then a manual "reprobe" was done, and several monitor
> > > > > > operations
> > > > > > were run.
> > > > > > Then again I see no more monitor actions in syslog.
> > > > > 

Re: [ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 09:13 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 27.06.2018 um
> > > > 16:32 in Nachricht
> 
> <1530109926.6452.3.ca...@redhat.com>:
> > On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
> > > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
> > > > > > > Ken Gaillot  schrieb am 26.06.2018
> > > > > > > um
> > > > > > > 18:22 in Nachricht
> > > > 
> > > > <1530030128.5202.5.ca...@redhat.com>:
> > > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> > > > > > 26.06.2018 09:14, Ulrich Windl wrote:
> > > > > > > Hi!
> > > > > > > 
> > > > > > > We just observed some strange effect we cannot explain in
> > > > > > > SLES
> > > > > > > 11
> > > > > > > SP4 (pacemaker 1.1.12-f47ea56):
> > > > > > > We run about a dozen of Xen PVMs on a three-node cluster
> > > > > > > (plus
> > > > > > > some
> > > > > > > infrastructure and monitoring stuff). It worked all well
> > > > > > > so
> > > > > > > far,
> > > > > > > and there was no significant change recently.
> > > > > > > However when a colleague stopped on VM for maintenance
> > > > > > > via
> > > > > > > cluster
> > > > > > > command, the cluster did not notice when the PVM actually
> > > > > > > was
> > > > > > > running again (it had been started not using the cluster
> > > > > > > (a
> > > > > > > bad
> > > > > > > idea, I know)).
> > > > > > 
> > > > > > To be on a safe side in such cases you'd probably want to
> > > > > > enable 
> > > > > > additional monitor for a "Stopped" role. Default one covers
> > > > > > only 
> > > > > > "Started" role. The same thing as for multistate resources,
> > > > > > where
> > > > > > you 
> > > > > > need several monitor ops, for "Started/Slave" and "Master"
> > > > > > roles.
> > > > > > But, this will increase a load.
> > > > > > And, I believe cluster should reprobe a resource on all
> > > > > > nodes
> > > > > > once
> > > > > > you 
> > > > > > change target-role back to "Started".
> > > > > 
> > > > > Which raises the question, how did you stop the VM initially?
> > > > 
> > > > I thought "(...) stopped one VM for maintenance via cluster
> > > > command"
> > > > is obvious. It was something like "crm resource stop ...".
> > > > 
> > > > > 
> > > > > If you stopped it by setting target-role to Stopped, likely
> > > > > the
> > > > > cluster
> > > > > still thinks it's stopped, and you need to set it to Started
> > > > > again.
> > > > > If
> > > > > instead you set maintenance mode or unmanaged the resource,
> > > > > then
> > > > > stopped the VM manually, then most likely it's still in that
> > > > > mode
> > > > > and
> > > > > needs to be taken out of it.
> > > > 
> > > > The point was when the command to start the resource was given,
> > > > the
> > > > cluster had completely ignored the fact that it was running
> > > > already
> > > > and started to start the VM on a second node (which may be
> > > > desastrous). But that's leading away from the main question...
> > > 
> > > Ah, this is expected behavior when you start a resource manually,
> > > and
> > > there are no monitors with target-role=Stopped. If the node where
> > > you
> > > manually started the VM isn't the same node the cluster happens
> > > to
> > > choose, then you can get multiple active instances.
> > > 
> > > By default, the cluster assumes that where a probe found a
> > > resource
> > > to
> > > be not running, that resource will stay not running unless
> > > started by
> > > the cluster. (It will re-probe if the node goes away and comes
> > > back.)
> > > 
> > > If you wish to guard against resources being started outside
> > > cluster
> > > control, configure a recurring monitor with target-role=Stopped,
> > > and
> > > the cluster will run that on all nodes where it thinks the
> > > resource
> > > is
> > > not supposed to be running. Of course since it has to poll at
> > > intervals, it can take up to that much time to detect a manually
> > > started instance.
> > 
> > Alternatively, if you don't want the overhead of a recurring
> > monitor
> > but want to be able to address known manual starts yourself, you
> > can
> > force a full reprobe of the resource with "crm_resource -r
> >  > id> --refresh".
> > 
> > If you do it before starting the resource via crm, the cluster will
> > stop the manually started instance, and then you can start it via
> > the
> > crm; if you do it after starting the resource via crm, there will
> > still
> > likely be two active instances, and the cluster will stop both and
> > start one again.
> > 
> > A way around that would be to unmanage the resource, start the
> > resource
> > via crm (which won't actually start anything due to being
> > unmanaged,
> > but will tell the cluster it's supposed to be started), force a
> > reprobe, then manage the resource again -- that should prevent
> > multiple
> > active. However if the cluster prefers a different node, it may
> > still
> > stop the resource and start it in its preferred location.
> > 

[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 27.06.2018 um 16:32 in 
>>> Nachricht
<1530109926.6452.3.ca...@redhat.com>:
> On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
>> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
>> > > > > Ken Gaillot  schrieb am 26.06.2018 um
>> > > > > 18:22 in Nachricht
>> > 
>> > <1530030128.5202.5.ca...@redhat.com>:
>> > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> > > > 26.06.2018 09:14, Ulrich Windl wrote:
>> > > > > Hi!
>> > > > > 
>> > > > > We just observed some strange effect we cannot explain in
>> > > > > SLES
>> > > > > 11
>> > > > > SP4 (pacemaker 1.1.12-f47ea56):
>> > > > > We run about a dozen of Xen PVMs on a three-node cluster
>> > > > > (plus
>> > > > > some
>> > > > > infrastructure and monitoring stuff). It worked all well so
>> > > > > far,
>> > > > > and there was no significant change recently.
>> > > > > However when a colleague stopped on VM for maintenance via
>> > > > > cluster
>> > > > > command, the cluster did not notice when the PVM actually was
>> > > > > running again (it had been started not using the cluster (a
>> > > > > bad
>> > > > > idea, I know)).
>> > > > 
>> > > > To be on a safe side in such cases you'd probably want to
>> > > > enable 
>> > > > additional monitor for a "Stopped" role. Default one covers
>> > > > only 
>> > > > "Started" role. The same thing as for multistate resources,
>> > > > where
>> > > > you 
>> > > > need several monitor ops, for "Started/Slave" and "Master"
>> > > > roles.
>> > > > But, this will increase a load.
>> > > > And, I believe cluster should reprobe a resource on all nodes
>> > > > once
>> > > > you 
>> > > > change target-role back to "Started".
>> > > 
>> > > Which raises the question, how did you stop the VM initially?
>> > 
>> > I thought "(...) stopped one VM for maintenance via cluster
>> > command"
>> > is obvious. It was something like "crm resource stop ...".
>> > 
>> > > 
>> > > If you stopped it by setting target-role to Stopped, likely the
>> > > cluster
>> > > still thinks it's stopped, and you need to set it to Started
>> > > again.
>> > > If
>> > > instead you set maintenance mode or unmanaged the resource, then
>> > > stopped the VM manually, then most likely it's still in that mode
>> > > and
>> > > needs to be taken out of it.
>> > 
>> > The point was when the command to start the resource was given, the
>> > cluster had completely ignored the fact that it was running already
>> > and started to start the VM on a second node (which may be
>> > desastrous). But that's leading away from the main question...
>> 
>> Ah, this is expected behavior when you start a resource manually, and
>> there are no monitors with target-role=Stopped. If the node where you
>> manually started the VM isn't the same node the cluster happens to
>> choose, then you can get multiple active instances.
>> 
>> By default, the cluster assumes that where a probe found a resource
>> to
>> be not running, that resource will stay not running unless started by
>> the cluster. (It will re-probe if the node goes away and comes back.)
>> 
>> If you wish to guard against resources being started outside cluster
>> control, configure a recurring monitor with target-role=Stopped, and
>> the cluster will run that on all nodes where it thinks the resource
>> is
>> not supposed to be running. Of course since it has to poll at
>> intervals, it can take up to that much time to detect a manually
>> started instance.
> 
> Alternatively, if you don't want the overhead of a recurring monitor
> but want to be able to address known manual starts yourself, you can
> force a full reprobe of the resource with "crm_resource -r  id> --refresh".
> 
> If you do it before starting the resource via crm, the cluster will
> stop the manually started instance, and then you can start it via the
> crm; if you do it after starting the resource via crm, there will still
> likely be two active instances, and the cluster will stop both and
> start one again.
> 
> A way around that would be to unmanage the resource, start the resource
> via crm (which won't actually start anything due to being unmanaged,
> but will tell the cluster it's supposed to be started), force a
> reprobe, then manage the resource again -- that should prevent multiple
> active. However if the cluster prefers a different node, it may still
> stop the resource and start it in its preferred location. (Stickiness
> could get around that.)

Hi!

Thanks again for that. There's one question that comes to my mind: What is the 
purpose of the cluster recheck interval? I thought it's exactly that, finding 
resources that are not in the state they should be.

Regards,
Ulrich


> 
>> 
>> > > > > Examining the logs, it seems that the recheck timer popped
>> > > > > periodically, but no monitor action was run for the VM (the
>> > > > > action
>> > > > > is configured to run every 10 minutes).
>> 
>> Recurring monitors are only recorded in the log if their return value
>> 

[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 27.06.2018 um 16:18 in 
>>> Nachricht
<1530109097.6452.1.ca...@redhat.com>:
> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
>> > > > Ken Gaillot  schrieb am 26.06.2018 um
>> > > > 18:22 in Nachricht
>> 
>> <1530030128.5202.5.ca...@redhat.com>:
>> > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> > > 26.06.2018 09:14, Ulrich Windl wrote:
>> > > > Hi!
>> > > > 
>> > > > We just observed some strange effect we cannot explain in SLES
>> > > > 11
>> > > > SP4 (pacemaker 1.1.12-f47ea56):
>> > > > We run about a dozen of Xen PVMs on a three-node cluster (plus
>> > > > some
>> > > > infrastructure and monitoring stuff). It worked all well so
>> > > > far,
>> > > > and there was no significant change recently.
>> > > > However when a colleague stopped on VM for maintenance via
>> > > > cluster
>> > > > command, the cluster did not notice when the PVM actually was
>> > > > running again (it had been started not using the cluster (a bad
>> > > > idea, I know)).
>> > > 
>> > > To be on a safe side in such cases you'd probably want to enable 
>> > > additional monitor for a "Stopped" role. Default one covers only 
>> > > "Started" role. The same thing as for multistate resources, where
>> > > you 
>> > > need several monitor ops, for "Started/Slave" and "Master" roles.
>> > > But, this will increase a load.
>> > > And, I believe cluster should reprobe a resource on all nodes
>> > > once
>> > > you 
>> > > change target-role back to "Started".
>> > 
>> > Which raises the question, how did you stop the VM initially?
>> 
>> I thought "(...) stopped one VM for maintenance via cluster command"
>> is obvious. It was something like "crm resource stop ...".
>> 
>> > 
>> > If you stopped it by setting target-role to Stopped, likely the
>> > cluster
>> > still thinks it's stopped, and you need to set it to Started again.
>> > If
>> > instead you set maintenance mode or unmanaged the resource, then
>> > stopped the VM manually, then most likely it's still in that mode
>> > and
>> > needs to be taken out of it.
>> 
>> The point was when the command to start the resource was given, the
>> cluster had completely ignored the fact that it was running already
>> and started to start the VM on a second node (which may be
>> desastrous). But that's leading away from the main question...
> 
> Ah, this is expected behavior when you start a resource manually, and
> there are no monitors with target-role=Stopped. If the node where you
> manually started the VM isn't the same node the cluster happens to
> choose, then you can get multiple active instances.
> 
> By default, the cluster assumes that where a probe found a resource to
> be not running, that resource will stay not running unless started by
> the cluster. (It will re-probe if the node goes away and comes back.)

But didn't this behavior change? I tohought it was different maybe a year ago 
or so.

> 
> If you wish to guard against resources being started outside cluster
> control, configure a recurring monitor with target-role=Stopped, and
> the cluster will run that on all nodes where it thinks the resource is
> not supposed to be running. Of course since it has to poll at
> intervals, it can take up to that much time to detect a manually
> started instance.

Did monitor roles exist always, or were those added some time ago?

> 
>> > > > Examining the logs, it seems that the recheck timer popped
>> > > > periodically, but no monitor action was run for the VM (the
>> > > > action
>> > > > is configured to run every 10 minutes).
> 
> Recurring monitors are only recorded in the log if their return value
> changed. If there are 10 successful monitors in a row and then a
> failure, only the first success and the failure are logged.

OK, din't know that.


Thanks a lot for the explanations!

Regards,
Ulrich
> 
>> > > > 
>> > > > Actually the only monitor operations found were:
>> > > > May 23 08:04:13
>> > > > Jun 13 08:13:03
>> > > > Jun 25 09:29:04
>> > > > Then a manual "reprobe" was done, and several monitor
>> > > > operations
>> > > > were run.
>> > > > Then again I see no more monitor actions in syslog.
>> > > > 
>> > > > What could be the reasons for this? Too many operations
>> > > > defined?
>> > > > 
>> > > > The other message I don't understand is like ":
>> > > > Rolling back scores from "
>> > > > 
>> > > > Could it be a new bug introduced in pacemaker, or could it be
>> > > > some
>> > > > configuration problem (The status is completely clean however)?
>> > > > 
>> > > > According to the packet changelog, there was no change since
>> > > > Nov
>> > > > 2016...
>> > > > 
>> > > > Regards,
>> > > > Ulrich
> -- 
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: