I have a much clearer idea of the problem you're seeing now, thankyou.
Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
Hi,
Below you can see my setup and my test, this shows that my cloned resource
with on-fail=block does not recover automatically.
My Setup:
# rpm -aq | grep -i pacemaker
pacemaker-libs-1.1.9-1512.el6.i686
pacemaker-cluster-libs-1.1.9-1512.el6.i686
pacemaker-cli-1.1.9-1512.el6.i686
pacemaker-1.1.9-1512.el6.i686
# crm configure show
node CSE-1
node CSE-2
primitive d_tomcat ocf:ntc:tomcat \
op monitor interval=15s timeout=510s on-fail=block \
op start interval=0 timeout=510s \
params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health
monitor_timeout=120 \
meta migration-threshold=1
primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111
iflabel=ha \
meta migration-threshold=1 failure-timeout=10
primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119
iflabel=ha \
meta migration-threshold=1 failure-timeout=10
group svc-cse ip_19 ip_11
clone cl_tomcat d_tomcat
colocation colo_tomcat inf: svc-cse cl_tomcat
order order_tomcat inf: cl_tomcat svc-cse
property $id=cib-bootstrap-options \
dc-version=1.1.9-1512.el6-2a917dd \
cluster-infrastructure=cman \
pe-warn-series-max=9 \
no-quorum-policy=ignore \
stonith-enabled=false \
pe-input-series-max=9 \
pe-error-series-max=9 \
last-lrm-refresh=1367582088
Currently only 1 node is available, CSE-1.
This is how I am currently testing my setup:
= Starting point: Everything up and running
# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Started
ip_11(ocf::heartbeat:IPaddr2):Started
Clone Set: cl_tomcat [d_tomcat]
Started: [ CSE-1 ]
Stopped: [ d_tomcat:1 ]
= Causing failure: Change system so tomcat is running but has a failure (in
attachment step_2.log)
# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
= Fixing failure: Revert system so tomcat is running without failure (in
attachment step_3.log)
# crm resource status
Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
As you can see in the logs the OCF script doesn't return any failure. This is
noticed by pacemaker,
however it doesn't reflect in crm_mon and it doesn't start the depending
resources.
Gr.
Johan
On 2013-05-03 03:04, Andrew Beekhof wrote:
On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
On 2013-05-01 05:48, Andrew Beekhof wrote:
On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
Hi All,
I'm trying to setup a specific configuration in our cluster, however I'm
struggling with my configuration.
This is what I'm trying to achieve:
On both nodes of the cluster a daemon must be running (tomcat).
Some failover addresses are configured and must be running on the node
with a correctly running tomcat.
I have this achieved with a cloned tomcat resource and an collocation
between the cloned tomcat and the failover addresses.
When I cause a failure in the tomcat on the node running the failover
addresses, the failover addresses will failover to the other node as
expected.
crm_mon shows that this tomcat has a failure.
When I configure the tomcat resource with failure-timeout=0, the failure
alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
All sounds right so far.
If my broken tomcat is automatically fixed, I expect this to be noticed by
pacemaker and that that node will be able to run my failover addresses,
however I don't see this happening.
This is very hard to discuss without seeing logs.
So you created a tomcat error, waited for pacemaker to notice, fixed the
error and observed the pacemaker did not re-notice?
How long did you wait? More than the 15s repeat interval I assume? Did at
least the resource agent notice?
When I configure the tomcat resource with failure-timeout=30, the failure
alarm in crm_mon is cleared after 30seconds however the tomcat is still
having a failure.
Can you define still having a failure?
You mean it still shows up in crm_mon?
Have you read this link?
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
Still having a