[ClusterLabs] [questionnaire] Do you overload pacemaker's meta-attributes to track your own data?

2018-06-28 Thread Jan Pokorný
Hello, and since it is a month since the preceding attempt to gather
some feedback, welcome to yet another simple set of questions that
I will be glad to have answered by as many of you as possible,
as an auxiliary indicator what's generally acceptable and what's not
within the userbase.

This time, I need to introduce context of the questions, since that's
important, and I am sorry it's rather long (feel free to skip lower
to the same original indentation level if you are in a time press):

  As you've surely heard when in touch with pacemaker, there's
  a level of declarative annotations for resources (whether primitive
  or otherwise), their operations and few other entities.  You'll
  find which ones (which identifiers in variable assignments emulated
  with identifier + value pairs) can be effectively applied in which
  context in the documentation[1] -- these are comprehended with
  pacemaker and put into resource allocation equations.

  Perhaps less known is the fact that these sets are open to possibly
  foreign, user-defined assignments that may effectively overload the
  the primary role of meta-attributes, dragging user-defined semantics
  there.  There may be warnings about doing so at the high-level
  management tools, but pacemaker won't protest by design, as this
  is also what allows for smooth configuration reuse with various
  point releases possibly acquiring new meanings for new identifiers.

  This possibility of a free-form consumer extensibility doesn't appear
  to be advertised anywhere (perhaps to prevent people confusing CIB,
  the configuration hierarchy, with generic key-value store, which it
  is rather not), and within the pacemaker configuration realms, it
  wasn't useful until it started to be an optional point of interest
  in location constraints thanks to ability to refer meta-attributes
  in the respective rules based on "value-source" indirection[2],
  which arrived with pacemaker 1.1.17.

  More experienced users/developers (intentionally sent to both lists)
  may already start suspecting potential namespace collisions between
  a narrow but possibly growing set identifiers claimed by pacemaker
  for its own (and here original) purpose, and those that are added
  by users, either so as to pose in the mentioned constraint rules
  or for some other, possibly external automation related purpose.

  So, I've figured out that with upcoming 2.0 release, we have a nice
  opportunity to start doing something about that, and the least
  effort, fully backward + tooling compatible, that would start
  getting us to a conflict-less situation is, in my opinion, to start
  actively pushing for a lexical cut, asking for a special
  prefix/naming convention for the mentioned custom additions.
  
  This initiative is meant to consist of two steps:
  
  a. modify the documentation to expressly detail said lexical
 requirement
 - you can read draft of my change as a pull request for pacemaker:
   https://github.com/ClusterLabs/pacemaker/pull/1523/files
   (warning: the respective discussion was somewhat heated,
   and is not a subject of examination nor of a special interest
   here), basically I suggest "x-*" naming, with full recommended
   convention being "x-appname_identifier"
  
  b. add a warning to the logs/standard error output (daemons/CLI)
 when not recognized as pacemaker's claimed identifier nor
 starting with dedicated prefix(es), possibly referring to
 the documentation stanza per a., in a similar way the user
 gets notified that no fencing devices were configured
 - this would need to be coded
 - note that this way, you would get actually warned about
   your own typos in the meta-attribute identifiers even
   if you are not using any high-level tooling

  This may be the final status quo, or the eventual separation
  of the identifiers makes it really easy to perform other schema
  upgrade related steps with future major schema version bumps
  _safely_.  Nobody is immediately forced to anything, although
  the above points should make it clear it's prudent to get ready
  (e.g. also regarding the custom tooling around that) in respect
  to future major pacemaker/schema version bumps and respective
  auto-upgrades of the configuration (say it will be declared
  it's valid to upgrade to pacemaker 3.0 only from as old pacemaker
  as 2.0 -- that's the justification for acting _now_ with preparing
  sane grounds slowly).

* * *

So now the promised questions; just send a reply where you [x] tick
your selections for the questions below, possibly with some more
commentary on the topic, and preferrably on-list (single of your
choice is enough):

1. In your cluster configurations, do you carry meta-attributes
   other than those recognized by pacemaker?

   [ ] no

   [ ] yes (if so, can you specify whether for said constraints
rules, as a way to permanently attach some kind of
administrative piec

Re: [ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 09:09 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 27.06.2018 um
> > > > 16:18 in Nachricht
> 
> <1530109097.6452.1.ca...@redhat.com>:
> > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
> > > > > > Ken Gaillot  schrieb am 26.06.2018 um
> > > > > > 18:22 in Nachricht
> > > 
> > > <1530030128.5202.5.ca...@redhat.com>:
> > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> > > > > 26.06.2018 09:14, Ulrich Windl wrote:
> > > > > > Hi!
> > > > > > 
> > > > > > We just observed some strange effect we cannot explain in
> > > > > > SLES
> > > > > > 11
> > > > > > SP4 (pacemaker 1.1.12-f47ea56):
> > > > > > We run about a dozen of Xen PVMs on a three-node cluster
> > > > > > (plus
> > > > > > some
> > > > > > infrastructure and monitoring stuff). It worked all well so
> > > > > > far,
> > > > > > and there was no significant change recently.
> > > > > > However when a colleague stopped on VM for maintenance via
> > > > > > cluster
> > > > > > command, the cluster did not notice when the PVM actually
> > > > > > was
> > > > > > running again (it had been started not using the cluster (a
> > > > > > bad
> > > > > > idea, I know)).
> > > > > 
> > > > > To be on a safe side in such cases you'd probably want to
> > > > > enable 
> > > > > additional monitor for a "Stopped" role. Default one covers
> > > > > only 
> > > > > "Started" role. The same thing as for multistate resources,
> > > > > where
> > > > > you 
> > > > > need several monitor ops, for "Started/Slave" and "Master"
> > > > > roles.
> > > > > But, this will increase a load.
> > > > > And, I believe cluster should reprobe a resource on all nodes
> > > > > once
> > > > > you 
> > > > > change target-role back to "Started".
> > > > 
> > > > Which raises the question, how did you stop the VM initially?
> > > 
> > > I thought "(...) stopped one VM for maintenance via cluster
> > > command"
> > > is obvious. It was something like "crm resource stop ...".
> > > 
> > > > 
> > > > If you stopped it by setting target-role to Stopped, likely the
> > > > cluster
> > > > still thinks it's stopped, and you need to set it to Started
> > > > again.
> > > > If
> > > > instead you set maintenance mode or unmanaged the resource,
> > > > then
> > > > stopped the VM manually, then most likely it's still in that
> > > > mode
> > > > and
> > > > needs to be taken out of it.
> > > 
> > > The point was when the command to start the resource was given,
> > > the
> > > cluster had completely ignored the fact that it was running
> > > already
> > > and started to start the VM on a second node (which may be
> > > desastrous). But that's leading away from the main question...
> > 
> > Ah, this is expected behavior when you start a resource manually,
> > and
> > there are no monitors with target-role=Stopped. If the node where
> > you
> > manually started the VM isn't the same node the cluster happens to
> > choose, then you can get multiple active instances.
> > 
> > By default, the cluster assumes that where a probe found a resource
> > to
> > be not running, that resource will stay not running unless started
> > by
> > the cluster. (It will re-probe if the node goes away and comes
> > back.)
> 
> But didn't this behavior change? I tohought it was different maybe a
> year ago or so.

Not that I know of. We have fixed some issues around probes, especially
around probing Pacemaker Remote connections and the resources running
on those nodes, and around ordering of various actions with probes.

> > If you wish to guard against resources being started outside
> > cluster
> > control, configure a recurring monitor with target-role=Stopped,
> > and
> > the cluster will run that on all nodes where it thinks the resource
> > is
> > not supposed to be running. Of course since it has to poll at
> > intervals, it can take up to that much time to detect a manually
> > started instance.
> 
> Did monitor roles exist always, or were those added some time ago?

They've always been around. Stopped is not commonly used, but separate
monitors for Master and Slave roles are commonplace.

> > 
> > > > > > Examining the logs, it seems that the recheck timer popped
> > > > > > periodically, but no monitor action was run for the VM (the
> > > > > > action
> > > > > > is configured to run every 10 minutes).
> > 
> > Recurring monitors are only recorded in the log if their return
> > value
> > changed. If there are 10 successful monitors in a row and then a
> > failure, only the first success and the failure are logged.
> 
> OK, din't know that.
> 
> 
> Thanks a lot for the explanations!
> 
> Regards,
> Ulrich
> > 
> > > > > > 
> > > > > > Actually the only monitor operations found were:
> > > > > > May 23 08:04:13
> > > > > > Jun 13 08:13:03
> > > > > > Jun 25 09:29:04
> > > > > > Then a manual "reprobe" was done, and several monitor
> > > > > > operations
> > > > > > were run.
> > > > > > Then again I see no more monitor actions in syslog.
> > > > > 

Re: [ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 09:13 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 27.06.2018 um
> > > > 16:32 in Nachricht
> 
> <1530109926.6452.3.ca...@redhat.com>:
> > On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
> > > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
> > > > > > > Ken Gaillot  schrieb am 26.06.2018
> > > > > > > um
> > > > > > > 18:22 in Nachricht
> > > > 
> > > > <1530030128.5202.5.ca...@redhat.com>:
> > > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> > > > > > 26.06.2018 09:14, Ulrich Windl wrote:
> > > > > > > Hi!
> > > > > > > 
> > > > > > > We just observed some strange effect we cannot explain in
> > > > > > > SLES
> > > > > > > 11
> > > > > > > SP4 (pacemaker 1.1.12-f47ea56):
> > > > > > > We run about a dozen of Xen PVMs on a three-node cluster
> > > > > > > (plus
> > > > > > > some
> > > > > > > infrastructure and monitoring stuff). It worked all well
> > > > > > > so
> > > > > > > far,
> > > > > > > and there was no significant change recently.
> > > > > > > However when a colleague stopped on VM for maintenance
> > > > > > > via
> > > > > > > cluster
> > > > > > > command, the cluster did not notice when the PVM actually
> > > > > > > was
> > > > > > > running again (it had been started not using the cluster
> > > > > > > (a
> > > > > > > bad
> > > > > > > idea, I know)).
> > > > > > 
> > > > > > To be on a safe side in such cases you'd probably want to
> > > > > > enable 
> > > > > > additional monitor for a "Stopped" role. Default one covers
> > > > > > only 
> > > > > > "Started" role. The same thing as for multistate resources,
> > > > > > where
> > > > > > you 
> > > > > > need several monitor ops, for "Started/Slave" and "Master"
> > > > > > roles.
> > > > > > But, this will increase a load.
> > > > > > And, I believe cluster should reprobe a resource on all
> > > > > > nodes
> > > > > > once
> > > > > > you 
> > > > > > change target-role back to "Started".
> > > > > 
> > > > > Which raises the question, how did you stop the VM initially?
> > > > 
> > > > I thought "(...) stopped one VM for maintenance via cluster
> > > > command"
> > > > is obvious. It was something like "crm resource stop ...".
> > > > 
> > > > > 
> > > > > If you stopped it by setting target-role to Stopped, likely
> > > > > the
> > > > > cluster
> > > > > still thinks it's stopped, and you need to set it to Started
> > > > > again.
> > > > > If
> > > > > instead you set maintenance mode or unmanaged the resource,
> > > > > then
> > > > > stopped the VM manually, then most likely it's still in that
> > > > > mode
> > > > > and
> > > > > needs to be taken out of it.
> > > > 
> > > > The point was when the command to start the resource was given,
> > > > the
> > > > cluster had completely ignored the fact that it was running
> > > > already
> > > > and started to start the VM on a second node (which may be
> > > > desastrous). But that's leading away from the main question...
> > > 
> > > Ah, this is expected behavior when you start a resource manually,
> > > and
> > > there are no monitors with target-role=Stopped. If the node where
> > > you
> > > manually started the VM isn't the same node the cluster happens
> > > to
> > > choose, then you can get multiple active instances.
> > > 
> > > By default, the cluster assumes that where a probe found a
> > > resource
> > > to
> > > be not running, that resource will stay not running unless
> > > started by
> > > the cluster. (It will re-probe if the node goes away and comes
> > > back.)
> > > 
> > > If you wish to guard against resources being started outside
> > > cluster
> > > control, configure a recurring monitor with target-role=Stopped,
> > > and
> > > the cluster will run that on all nodes where it thinks the
> > > resource
> > > is
> > > not supposed to be running. Of course since it has to poll at
> > > intervals, it can take up to that much time to detect a manually
> > > started instance.
> > 
> > Alternatively, if you don't want the overhead of a recurring
> > monitor
> > but want to be able to address known manual starts yourself, you
> > can
> > force a full reprobe of the resource with "crm_resource -r
> >  > id> --refresh".
> > 
> > If you do it before starting the resource via crm, the cluster will
> > stop the manually started instance, and then you can start it via
> > the
> > crm; if you do it after starting the resource via crm, there will
> > still
> > likely be two active instances, and the cluster will stop both and
> > start one again.
> > 
> > A way around that would be to unmanage the resource, start the
> > resource
> > via crm (which won't actually start anything due to being
> > unmanaged,
> > but will tell the cluster it's supposed to be started), force a
> > reprobe, then manage the resource again -- that should prevent
> > multiple
> > active. However if the cluster prefers a different node, it may
> > still
> > stop the resource and start it in its preferred location.
> > (Stickines

Re: [ClusterLabs] Pacemaker not restarting Resource on same node

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 19:58 +0300, Andrei Borzenkov wrote:
> 28.06.2018 18:35, Dileep V Nair пишет:
> > 
> > 
> > Hi,
> > 
> > I have a cluster with DB2 running in HADR mode. I have used the
> > db2
> > resource agent. My problem is whenever DB2 fails on primary it is
> > migrating
> > to the secondary node. Ideally it should restart thrice (Migration
> > Threshold set to 3) but not happening. This is causing extra
> > downtime for
> > customer. Is there any other settings / parameters which needs to
> > be set.
> > Did anyone face similar issue ? I am on pacemaker version 1.1.15-
> > 21.1.
> > 
> 
> It is impossible to answer without good knowledge of application and
> resource agent. From quick look at resource agent, it removes master
> score from current node if database failure is detected which means
> current node will not be eligible for fail-over.
> 
> Note that pacemaker does not really have concept of "restarting
> resource
> on the same node". Every time it performs full node selection using
> current scores. It usually happens to be "same node" simply due to
> non-zero resource stickiness by default. You could attempt to adjust
> stickiness so that final score will be larger than master score on
> standby. But that also needs agent cooperation - are you sure agent
> will
> even attempt to restart failed master locally?

Also, some types of errors cannot be recovered by a restart on the same
node.

For example, by default, start failures will not be retried on the same
node (see the cluster property start-failure-is-fatal), to avoid a
repeatedly failing start preventing the cluster from doing anything
else. Certain OCF resource agent exit codes are considered "hard"
errors that prevent retrying on the same node: missing dependencies,
file permission errors, etc.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Install fresh pacemaker + corosync fails

2018-06-28 Thread Ken Gaillot
On Thu, 2018-06-28 at 17:17 +0200, Salvatore D'angelo wrote:
> Hi All,
> 
> I am here again. I am still fighting against upgrade problems but now
> I am trying to change the approach.
> I want now to try to install fresh a new version Corosync and
> Postgres to have it working.
> For the moment I am not interested to a specific configuration, just
> three nodes where I can run a dummy resource as in this tutorial.
> 
> I prefer to download specific version of the packages but I am ok to
> whatever new version for now.
> I followed the following procedure:
> https://wiki.clusterlabs.org/wiki/SourceInstall
> 
> but this procedure fails the compilation. If I want to compile from
> source it’s not clear what are the dependencies. 
> I started from a scratch Ubuntu 14.04 (I only configured ssh to
> connect to the machines).
> 
> For libqb I had to install with apt-get install the following
> dependencies:
> autoconf
> libtool 
> 
> Corosync compilation failed to the step 
> ./autogen.sh && ./configure --prefix=$PREFIX 
> with the following error:
> 
> checking for knet... no
> configure: error: Package requirements (libknet) were not met:
> 
> No package 'libknet' found
> 
> Consider adjusting the PKG_CONFIG_PATH environment variable if you
> installed software in a non-standard prefix.
> 
> I tried to install with apt-get the libraries libknet1 and libknet-
> dev but they were not found. Tried to download the source code of
> this library here:
> https://github.com/kronosnet/kronosnet
> 
> but the ./autogen.sh && ./configure --prefix=$PREFIX step failed too
> with this error:
> 
> configure: error: Package requirements (liblz4) were not met:
> No package 'liblz4’ found
> 
> I installed liblz4 and liblz4-dev but problem still occurs.
> 
> I am going around in circle here. I am asking if someone tested the
> install procedure on Ubuntu 14.04 and can give me the exact steps to
> install fresh pacemaker 1.1.18 (or later) with corosync 2.4.4 (or
> later).
> Thanks in advance for help.
> 

Pacemaker's dependencies are listed at:

https://github.com/ClusterLabs/pacemaker/blob/master/INSTALL.md

Some of the package names may be different on Ubuntu. Unless you are
looking for specific features, I'd go with whatever stock packages are
available for libqb and corosync.

knet will be supported by corosync 3 and is bleeding-edge at the moment
(though probably solid). If you do want to compile libqb and/or
corosync, the guide on the wiki grabs the latest master branch of the
various projects using git; I'd recommend downloading source tarballs
of the latest official releases instead, as they will be more stable.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker not restarting Resource on same node

2018-06-28 Thread Andrei Borzenkov
28.06.2018 18:35, Dileep V Nair пишет:
> 
> 
> Hi,
> 
>   I have a cluster with DB2 running in HADR mode. I have used the db2
> resource agent. My problem is whenever DB2 fails on primary it is migrating
> to the secondary node. Ideally it should restart thrice (Migration
> Threshold set to 3) but not happening. This is causing extra downtime for
> customer. Is there any other settings / parameters which needs to be set.
> Did anyone face similar issue ? I am on pacemaker version 1.1.15-21.1.
> 

It is impossible to answer without good knowledge of application and
resource agent. From quick look at resource agent, it removes master
score from current node if database failure is detected which means
current node will not be eligible for fail-over.

Note that pacemaker does not really have concept of "restarting resource
on the same node". Every time it performs full node selection using
current scores. It usually happens to be "same node" simply due to
non-zero resource stickiness by default. You could attempt to adjust
stickiness so that final score will be larger than master score on
standby. But that also needs agent cooperation - are you sure agent will
even attempt to restart failed master locally?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker not restarting Resource on same node

2018-06-28 Thread Dileep V Nair


Hi,

I have a cluster with DB2 running in HADR mode. I have used the db2
resource agent. My problem is whenever DB2 fails on primary it is migrating
to the secondary node. Ideally it should restart thrice (Migration
Threshold set to 3) but not happening. This is causing extra downtime for
customer. Is there any other settings / parameters which needs to be set.
Did anyone face similar issue ? I am on pacemaker version 1.1.15-21.1.

Dileep V Nair

dilen...@in.ibm.com

IBM Services
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Install fresh pacemaker + corosync fails

2018-06-28 Thread Salvatore D'angelo
Hi All,

I am here again. I am still fighting against upgrade problems but now I am 
trying to change the approach.
I want now to try to install fresh a new version Corosync and Postgres to have 
it working.
For the moment I am not interested to a specific configuration, just three 
nodes where I can run a dummy resource as in this tutorial.

I prefer to download specific version of the packages but I am ok to whatever 
new version for now.
I followed the following procedure:
https://wiki.clusterlabs.org/wiki/SourceInstall 


but this procedure fails the compilation. If I want to compile from source it’s 
not clear what are the dependencies. 
I started from a scratch Ubuntu 14.04 (I only configured ssh to connect to the 
machines).

For libqb I had to install with apt-get install the following dependencies:
autoconf
libtool 

Corosync compilation failed to the step 
./autogen.sh && ./configure --prefix=$PREFIX 
with the following error:

checking for knet... no
configure: error: Package requirements (libknet) were not met:

No package 'libknet' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

I tried to install with apt-get the libraries libknet1 and libknet-dev but they 
were not found. Tried to download the source code of this library here:
https://github.com/kronosnet/kronosnet 

but the ./autogen.sh && ./configure --prefix=$PREFIX step failed too with this 
error:

configure: error: Package requirements (liblz4) were not met:
No package 'liblz4’ found

I installed liblz4 and liblz4-dev but problem still occurs.

I am going around in circle here. I am asking if someone tested the install 
procedure on Ubuntu 14.04 and can give me the exact steps to install fresh 
pacemaker 1.1.18 (or later) with corosync 2.4.4 (or later).
Thanks in advance for help.






___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 27.06.2018 um 16:32 in 
>>> Nachricht
<1530109926.6452.3.ca...@redhat.com>:
> On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
>> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
>> > > > > Ken Gaillot  schrieb am 26.06.2018 um
>> > > > > 18:22 in Nachricht
>> > 
>> > <1530030128.5202.5.ca...@redhat.com>:
>> > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> > > > 26.06.2018 09:14, Ulrich Windl wrote:
>> > > > > Hi!
>> > > > > 
>> > > > > We just observed some strange effect we cannot explain in
>> > > > > SLES
>> > > > > 11
>> > > > > SP4 (pacemaker 1.1.12-f47ea56):
>> > > > > We run about a dozen of Xen PVMs on a three-node cluster
>> > > > > (plus
>> > > > > some
>> > > > > infrastructure and monitoring stuff). It worked all well so
>> > > > > far,
>> > > > > and there was no significant change recently.
>> > > > > However when a colleague stopped on VM for maintenance via
>> > > > > cluster
>> > > > > command, the cluster did not notice when the PVM actually was
>> > > > > running again (it had been started not using the cluster (a
>> > > > > bad
>> > > > > idea, I know)).
>> > > > 
>> > > > To be on a safe side in such cases you'd probably want to
>> > > > enable 
>> > > > additional monitor for a "Stopped" role. Default one covers
>> > > > only 
>> > > > "Started" role. The same thing as for multistate resources,
>> > > > where
>> > > > you 
>> > > > need several monitor ops, for "Started/Slave" and "Master"
>> > > > roles.
>> > > > But, this will increase a load.
>> > > > And, I believe cluster should reprobe a resource on all nodes
>> > > > once
>> > > > you 
>> > > > change target-role back to "Started".
>> > > 
>> > > Which raises the question, how did you stop the VM initially?
>> > 
>> > I thought "(...) stopped one VM for maintenance via cluster
>> > command"
>> > is obvious. It was something like "crm resource stop ...".
>> > 
>> > > 
>> > > If you stopped it by setting target-role to Stopped, likely the
>> > > cluster
>> > > still thinks it's stopped, and you need to set it to Started
>> > > again.
>> > > If
>> > > instead you set maintenance mode or unmanaged the resource, then
>> > > stopped the VM manually, then most likely it's still in that mode
>> > > and
>> > > needs to be taken out of it.
>> > 
>> > The point was when the command to start the resource was given, the
>> > cluster had completely ignored the fact that it was running already
>> > and started to start the VM on a second node (which may be
>> > desastrous). But that's leading away from the main question...
>> 
>> Ah, this is expected behavior when you start a resource manually, and
>> there are no monitors with target-role=Stopped. If the node where you
>> manually started the VM isn't the same node the cluster happens to
>> choose, then you can get multiple active instances.
>> 
>> By default, the cluster assumes that where a probe found a resource
>> to
>> be not running, that resource will stay not running unless started by
>> the cluster. (It will re-probe if the node goes away and comes back.)
>> 
>> If you wish to guard against resources being started outside cluster
>> control, configure a recurring monitor with target-role=Stopped, and
>> the cluster will run that on all nodes where it thinks the resource
>> is
>> not supposed to be running. Of course since it has to poll at
>> intervals, it can take up to that much time to detect a manually
>> started instance.
> 
> Alternatively, if you don't want the overhead of a recurring monitor
> but want to be able to address known manual starts yourself, you can
> force a full reprobe of the resource with "crm_resource -r  id> --refresh".
> 
> If you do it before starting the resource via crm, the cluster will
> stop the manually started instance, and then you can start it via the
> crm; if you do it after starting the resource via crm, there will still
> likely be two active instances, and the cluster will stop both and
> start one again.
> 
> A way around that would be to unmanage the resource, start the resource
> via crm (which won't actually start anything due to being unmanaged,
> but will tell the cluster it's supposed to be started), force a
> reprobe, then manage the resource again -- that should prevent multiple
> active. However if the cluster prefers a different node, it may still
> stop the resource and start it in its preferred location. (Stickiness
> could get around that.)

Hi!

Thanks again for that. There's one question that comes to my mind: What is the 
purpose of the cluster recheck interval? I thought it's exactly that, finding 
resources that are not in the state they should be.

Regards,
Ulrich


> 
>> 
>> > > > > Examining the logs, it seems that the recheck timer popped
>> > > > > periodically, but no monitor action was run for the VM (the
>> > > > > action
>> > > > > is configured to run every 10 minutes).
>> 
>> Recurring monitors are only recorded in the log if their return value
>> change

[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

2018-06-28 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 27.06.2018 um 16:18 in 
>>> Nachricht
<1530109097.6452.1.ca...@redhat.com>:
> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
>> > > > Ken Gaillot  schrieb am 26.06.2018 um
>> > > > 18:22 in Nachricht
>> 
>> <1530030128.5202.5.ca...@redhat.com>:
>> > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> > > 26.06.2018 09:14, Ulrich Windl wrote:
>> > > > Hi!
>> > > > 
>> > > > We just observed some strange effect we cannot explain in SLES
>> > > > 11
>> > > > SP4 (pacemaker 1.1.12-f47ea56):
>> > > > We run about a dozen of Xen PVMs on a three-node cluster (plus
>> > > > some
>> > > > infrastructure and monitoring stuff). It worked all well so
>> > > > far,
>> > > > and there was no significant change recently.
>> > > > However when a colleague stopped on VM for maintenance via
>> > > > cluster
>> > > > command, the cluster did not notice when the PVM actually was
>> > > > running again (it had been started not using the cluster (a bad
>> > > > idea, I know)).
>> > > 
>> > > To be on a safe side in such cases you'd probably want to enable 
>> > > additional monitor for a "Stopped" role. Default one covers only 
>> > > "Started" role. The same thing as for multistate resources, where
>> > > you 
>> > > need several monitor ops, for "Started/Slave" and "Master" roles.
>> > > But, this will increase a load.
>> > > And, I believe cluster should reprobe a resource on all nodes
>> > > once
>> > > you 
>> > > change target-role back to "Started".
>> > 
>> > Which raises the question, how did you stop the VM initially?
>> 
>> I thought "(...) stopped one VM for maintenance via cluster command"
>> is obvious. It was something like "crm resource stop ...".
>> 
>> > 
>> > If you stopped it by setting target-role to Stopped, likely the
>> > cluster
>> > still thinks it's stopped, and you need to set it to Started again.
>> > If
>> > instead you set maintenance mode or unmanaged the resource, then
>> > stopped the VM manually, then most likely it's still in that mode
>> > and
>> > needs to be taken out of it.
>> 
>> The point was when the command to start the resource was given, the
>> cluster had completely ignored the fact that it was running already
>> and started to start the VM on a second node (which may be
>> desastrous). But that's leading away from the main question...
> 
> Ah, this is expected behavior when you start a resource manually, and
> there are no monitors with target-role=Stopped. If the node where you
> manually started the VM isn't the same node the cluster happens to
> choose, then you can get multiple active instances.
> 
> By default, the cluster assumes that where a probe found a resource to
> be not running, that resource will stay not running unless started by
> the cluster. (It will re-probe if the node goes away and comes back.)

But didn't this behavior change? I tohought it was different maybe a year ago 
or so.

> 
> If you wish to guard against resources being started outside cluster
> control, configure a recurring monitor with target-role=Stopped, and
> the cluster will run that on all nodes where it thinks the resource is
> not supposed to be running. Of course since it has to poll at
> intervals, it can take up to that much time to detect a manually
> started instance.

Did monitor roles exist always, or were those added some time ago?

> 
>> > > > Examining the logs, it seems that the recheck timer popped
>> > > > periodically, but no monitor action was run for the VM (the
>> > > > action
>> > > > is configured to run every 10 minutes).
> 
> Recurring monitors are only recorded in the log if their return value
> changed. If there are 10 successful monitors in a row and then a
> failure, only the first success and the failure are logged.

OK, din't know that.


Thanks a lot for the explanations!

Regards,
Ulrich
> 
>> > > > 
>> > > > Actually the only monitor operations found were:
>> > > > May 23 08:04:13
>> > > > Jun 13 08:13:03
>> > > > Jun 25 09:29:04
>> > > > Then a manual "reprobe" was done, and several monitor
>> > > > operations
>> > > > were run.
>> > > > Then again I see no more monitor actions in syslog.
>> > > > 
>> > > > What could be the reasons for this? Too many operations
>> > > > defined?
>> > > > 
>> > > > The other message I don't understand is like ":
>> > > > Rolling back scores from "
>> > > > 
>> > > > Could it be a new bug introduced in pacemaker, or could it be
>> > > > some
>> > > > configuration problem (The status is completely clean however)?
>> > > > 
>> > > > According to the packet changelog, there was no change since
>> > > > Nov
>> > > > 2016...
>> > > > 
>> > > > Regards,
>> > > > Ulrich
> -- 
> Ken Gaillot 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.cluster