Re: [Pacemaker] failure handling on a cloned resource

2013-05-06 Thread Andrew Beekhof
I have a much clearer idea of the problem you're seeing now, thankyou.

Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?

On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote:

 Hi,
 
 Below you can see my setup and my test, this shows that my cloned resource 
 with on-fail=block does not recover automatically.
 
 My Setup:
 
 # rpm -aq | grep -i pacemaker
 pacemaker-libs-1.1.9-1512.el6.i686
 pacemaker-cluster-libs-1.1.9-1512.el6.i686
 pacemaker-cli-1.1.9-1512.el6.i686
 pacemaker-1.1.9-1512.el6.i686
 
 # crm configure show
 node CSE-1
 node CSE-2
 primitive d_tomcat ocf:ntc:tomcat \
op monitor interval=15s timeout=510s on-fail=block \
op start interval=0 timeout=510s \
params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health 
 monitor_timeout=120 \
meta migration-threshold=1
 primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111 
 iflabel=ha \
meta migration-threshold=1 failure-timeout=10
 primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119 
 iflabel=ha \
meta migration-threshold=1 failure-timeout=10
 group svc-cse ip_19 ip_11
 clone cl_tomcat d_tomcat
 colocation colo_tomcat inf: svc-cse cl_tomcat
 order order_tomcat inf: cl_tomcat svc-cse
 property $id=cib-bootstrap-options \
dc-version=1.1.9-1512.el6-2a917dd \
cluster-infrastructure=cman \
pe-warn-series-max=9 \
no-quorum-policy=ignore \
stonith-enabled=false \
pe-input-series-max=9 \
pe-error-series-max=9 \
last-lrm-refresh=1367582088
 
 Currently only 1 node is available, CSE-1.
 
 
 This is how I am currently testing my setup:
 
 = Starting point: Everything up and running
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Started
 ip_11(ocf::heartbeat:IPaddr2):Started
 Clone Set: cl_tomcat [d_tomcat]
 Started: [ CSE-1 ]
 Stopped: [ d_tomcat:1 ]
 
 = Causing failure: Change system so tomcat is running but has a failure (in 
 attachment step_2.log)
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]
 
 = Fixing failure: Revert system so tomcat is running without failure (in 
 attachment step_3.log)
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]
 
 As you can see in the logs the OCF script doesn't return any failure. This is 
 noticed by pacemaker,
 however it doesn't reflect in crm_mon and it doesn't start the depending 
 resources.
 
 Gr.
 Johan
 
 On 2013-05-03 03:04, Andrew Beekhof wrote:
 On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 On 2013-05-01 05:48, Andrew Beekhof wrote:
 On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 Hi All,
 
 I'm trying to setup a specific configuration in our cluster, however I'm 
 struggling with my configuration.
 
 This is what I'm trying to achieve:
 On both nodes of the cluster a daemon must be running (tomcat).
 Some failover addresses are configured and must be running on the node 
 with a correctly running tomcat.
 
 I have this achieved with a cloned tomcat resource and an collocation 
 between the cloned tomcat and the failover addresses.
 When I cause a failure in the tomcat on the node running the failover 
 addresses, the failover addresses will failover to the other node as 
 expected.
 crm_mon shows that this tomcat has a failure.
 When I configure the tomcat resource with failure-timeout=0, the failure 
 alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
 All sounds right so far.
 If my broken tomcat is automatically fixed, I expect this to be noticed by 
 pacemaker and that that node will be able to run my failover addresses,
 however I don't see this happening.
 This is very hard to discuss without seeing logs.
 
 So you created a tomcat error, waited for pacemaker to notice, fixed the 
 error and observed the pacemaker did not re-notice?
 How long did you wait? More than the 15s repeat interval I assume?  Did at 
 least the resource agent notice?
 
 When I configure the tomcat resource with failure-timeout=30, the failure 
 alarm in crm_mon is cleared after 30seconds however the tomcat is still 
 having a failure.
 Can you define still having a failure?
 You mean it still shows up in crm_mon?
 Have you read this link?

 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
 Still having a 

Re: [Pacemaker] Corosync 2.3 dies randomly

2013-05-06 Thread Andrew Beekhof

On 06/05/2013, at 3:27 AM, Robert Parsons rpars...@tappublishing.com wrote:

 
 I'm trying to build out a web farm cluster using Corosync/Pacemaker. I 
 started with the stock versions in Ubuntu 12.04 but did not have a lot of 
 success. I removed the corosync (1.x) and pacemaker packages and built 
 Corosync 2.3 and Pacemaker 1.1.9 from source. It generally seems to run 
 better but I am having big issues with Corosync. I have 14 completely 
 identical nodes. They differ only in ip address and host name. Periodically, 
 corosync will fail to start up on boot. It's not consistent and happens 
 randomly. What's even worse, once the cluster is up Corosync will 
 occasionally die for no apparent reason. There are not errors logged. 
 Nothing. The process simply disappears, taking the node offline.
 
 My cluster has zero stability thanks to this Corosync issue. Anyone got any 
 ideas?

Corosync has a blackbox - did you interrogate that too?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org