[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-25 Thread Samarth Jain
Hi,


We have a bunch of resources running in master slave configuration with one
master and one slave instance running at any given time.

What we observe is, that for any two given resources at a time, if say
resource Stateful_Test_1 is in middle of doing a promote and it takes
significant amount of time (close to 150 seconds in our scenario) for it to
complete promote (like starting a web server) and, during this time, say
resource Stateful_Test_2's master instance fails, then the failure of
Stateful_Test_2 master is never honored by pengine and the monitor being
reoccurring keeps on failing without any action being taken by the DC.

We see below logs for the failure of Stateful_Test_2 in the DC which was
VM-3 at that time:

Feb 25 11:28:13 [6013] VM-3   crmd:   notice: abort_transition_graph:
Transition aborted by operation Stateful_Test_2_monitor_17000 'create'
on VM-1: Old event | magic=0:9;329:8:8:4a2b407e-ad15-43d0-8248-e70f9f22436b
cib=0.191.5 source=process_graph_event:498 complete=false

As per our current testing, the Stateful_Test_2 resource has failed 590
times and it still continues to fail!! without the failure being processed
by pacemaker. We have to manually intervene to recover it by doing a
resource restart.

Could you please help me understand:
1. Why doesn't pacemaker process the failure of Stateful_Test_2 resource
immediately after first failure?
2. Why does the monitor failure of Stateful_Test_2 continue even after the
promote of Stateful_Test_1 has been completed? Shouldn't it handle
Stateful_Test_2's failure and take necessary action on it? It feels as if
that particular failure 'event' has been 'dropped' and pengine is not even
aware of the Stateful_Test_2's failure.

It's pretty straightforward to reproduce this issue.
I have attached the two dummy resource agents which we used to simulate our
scenario along with the commands used to configure the resource and ban it
on other VMs in the cluster.

Note: We have intentionally kept monitor intervals as low, contrary to the
suggestions since we want our failure detection to be faster as the
resources are critical to our component

Once the resources are configured, you need to issue following two commands
to reproduce the problem:
1. crm resource restart Stateful_Test_1
In another session tab, wherever Stateful_Test_2 is running as master, you
need to delete the marker file which is monitored as part of
Stateful_Test_2's master monitor.
In our case it was in VM-1 so I deleted the marker from there.
[root@VM-1~]
# rm -f /root/stateful_Test_2_Marker

Now if you check the logs in /root, you would see that Stateful_Test_2
would print failure logs back to back. Showing a sample from our current
Stateful_Test_2.log file.
# cat /root/Stateful_Test_2.log
Mon Feb 25 11:00:29 IST 2019 Inside promote for Stateful_Test_2!
Mon Feb 25 11:00:34 IST 2019 Promote for Stateful_Test_2 completed!
Mon Feb 25 11:28:13 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:28:30 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:28:47 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:04 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:21 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:38 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:29:55 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:12 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:29 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 11:30:46 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
.
.
.
Mon Feb 25 14:15:08 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:25 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:42 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9
Mon Feb 25 14:15:59 IST 2019 Master monitor failed for Stateful_Test_2.
Returning 9

# pacemakerd --version
Pacemaker 1.1.18
Written by Andrew Beekhof

# corosync -v
Corosync Cluster Engine, version '2.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.

Below is my cluster configuration:

node 1: VM-0
node 2: VM-1
node 3: VM-2
node 4: VM-3
node 5: VM-4
node 6: VM-5
primitive Stateful_Test_1 ocf:pacemaker:Stateful_Test_1 \
op start timeout=200s interval=0 \
op promote timeout=300s interval=0 \
op monitor interval=15s role=Master timeout=30s \
op monitor interval=20s role=Slave timeout=30s \
op stop on-fail=restart interval=0 \
meta resource-stickiness=100 migration-threshold=1
failure-timeout=15s
primitive Stateful_Test_2 ocf:pacemaker:Stateful_Test_2 \
op start timeout=200s interval=0 \
op promote timeout=300s interval=0 \
op monitor interval=17s role=Master timeout=30s \
op monitor interval=25s role=Slav

[ClusterLabs] Monitor being called repeatedly for Master/Slave resource despite monitor failure

2018-02-19 Thread Samarth Jain
Hi,


I have configured wildfly resource in master slave mode on a 6 VM cluster
with stonith disabled and and no quorum policy set to ignore.

We are observing that on either of master or slave resource failure,
pacemaker keeps on calling stateful_monitor for wildfly repeatedly, despite
us returning appropriate failure return codes on monitor failure for both
master (rc=OCF_MASTER_FAILED) and slave (rc=OCF_NOT_RUNNING).

This continues till failure-timeout is reached after which the resource
gets demoted and stopped in case of master monitor failure and stopped in
case of slave monitor failure.

# pacemakerd --version
Pacemaker 1.1.16
Written by Andrew Beekhof

# corosync -v
Corosync Cluster Engine, version '2.4.2'
Copyright (c) 2006-2009 Red Hat, Inc.

Below is my configuration:

node 1: VM-0
node 2: VM-1
node 3: VM-2
node 4: VM-3
node 5: VM-4
node 6: VM-5
primitive stateful_wildfly ocf:pacemaker:wildfly \
op start timeout=200s interval=0 \
op promote timeout=300s interval=0 \
op monitor interval=90s role=Master timeout=90s \
op monitor interval=80s role=Slave timeout=100s \
meta resource-stickiness=100 migration-threshold=3
failure-timeout=240s
ms wildfly_MS stateful_wildfly \
location stateful_wildfly_rule_2 wildfly_MS \
rule -inf: #uname eq VM-2
location stateful_wildfly_rule_3 wildfly_MS \
rule -inf: #uname eq VM-3
location stateful_wildfly_rule_4 wildfly_MS \
rule -inf: #uname eq VM-4
location stateful_wildfly_rule_5 wildfly_MS \
rule -inf: #uname eq VM-5
property cib-bootstrap-options: \
stonith-enabled=false \
no-quorum-policy=ignore \
cluster-recheck-interval=30s \
start-failure-is-fatal=false \
stop-all-resources=false \
have-watchdog=false \
dc-version=1.1.16-94ff4df51a \
cluster-infrastructure=corosync \
cluster-name=hacluster-0

Could you please help us in understanding this behavior and how to fix this?


Thanks!
Samarth J
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org