Re: [ClusterLabs] Resources not starting some times after node reboot

2015-10-30 Thread Ken Gaillot
On 10/29/2015 12:42 PM, Pritam Kharat wrote:
> Hi All,
> 
> I have single node with 5 resources running on it. When I rebooted node
> sometimes I saw resources in stopped state though node comes online.
> 
> When looked in to the logs, one difference found in success and failure
> case is, when
> *Election Trigger (I_DC_TIMEOUT) just popped (2ms)  *occurred LRM did
> not start the resources instead jumped to monitor action and then onwards
> it did not start the resources at all.
> 
> But in success case this Election timeout did not come and first action
> taken by LRM was start the resource and then start monitoring it making all
> the resources started properly.
> 
> I have attached both the success and failure logs. Could some one please
> explain the reason for this issue  and how to solve this ?
> 
> 
> My CRM configuration is -
> 
> root@sc-node-2:~# crm configure show
> node $id="2" sc-node-2
> primitive oc-fw-agent upstart:oc-fw-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-lb-agent upstart:oc-lb-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-service-manager upstart:oc-service-manager \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive oc-vpn-agent upstart:oc-vpn-agent \
> meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
> op monitor interval="15s" timeout="60s"
> primitive sc_vip ocf:heartbeat:IPaddr2 \
> params ip="200.10.10.188" cidr_netmask="24" nic="eth1" \
> op monitor interval="15s"
> group sc-resources sc_vip oc-service-manager oc-fw-agent oc-lb-agent
> oc-vpn-agent
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> stonith-enabled="false" \
> cluster-recheck-interval="3min" \
> default-action-timeout="180s"

The attached logs don't go far enough to be sure what happened; all they
show at that point is that in both cases, the cluster correctly probed
all the resources to be sure they weren't already running.

The behavior shouldn't be different depending on the election trigger,
but it's hard to say for sure from this info.

With a single-node cluster, you should also set no-quorum-policy=ignore.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Resources not starting some times after node reboot

2015-10-29 Thread Pritam Kharat
Hi All,

I have single node with 5 resources running on it. When I rebooted node
sometimes I saw resources in stopped state though node comes online.

When looked in to the logs, one difference found in success and failure
case is, when
*Election Trigger (I_DC_TIMEOUT) just popped (2ms)  *occurred LRM did
not start the resources instead jumped to monitor action and then onwards
it did not start the resources at all.

But in success case this Election timeout did not come and first action
taken by LRM was start the resource and then start monitoring it making all
the resources started properly.

I have attached both the success and failure logs. Could some one please
explain the reason for this issue  and how to solve this ?


My CRM configuration is -

root@sc-node-2:~# crm configure show
node $id="2" sc-node-2
primitive oc-fw-agent upstart:oc-fw-agent \
meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
op monitor interval="15s" timeout="60s"
primitive oc-lb-agent upstart:oc-lb-agent \
meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
op monitor interval="15s" timeout="60s"
primitive oc-service-manager upstart:oc-service-manager \
meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
op monitor interval="15s" timeout="60s"
primitive oc-vpn-agent upstart:oc-vpn-agent \
meta allow-migrate="true" migration-threshold="5" failure-timeout="120s" \
op monitor interval="15s" timeout="60s"
primitive sc_vip ocf:heartbeat:IPaddr2 \
params ip="200.10.10.188" cidr_netmask="24" nic="eth1" \
op monitor interval="15s"
group sc-resources sc_vip oc-service-manager oc-fw-agent oc-lb-agent
oc-vpn-agent
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
cluster-recheck-interval="3min" \
default-action-timeout="180s"


-- 
Thanks and Regards,
Pritam Kharat.
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: do_state_transition: 
State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: crmd_join_phase_log: 
join-1: sc-node-2=integrated
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: do_dc_join_finalize: 
join-1: Syncing our CIB to the rest of the cluster
Oct 29 13:02:15 [1016] sc-node-2cib:   notice: corosync_node_name:  
Unable to get node name for nodeid 2
Oct 29 13:02:15 [1016] sc-node-2cib:   notice: get_node_name:   
Defaulting to uname -n for the local corosync node name
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_sync operation for section 'all': OK (rc=0, origin=local/crmd/14, 
version=0.11.0)
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: crm_update_peer_join:
finalize_join_for: Node sc-node-2[2] - join-1 phase 2 -> 3
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_modify operation for section nodes: OK (rc=0, 
origin=local/crmd/15, version=0.11.0)
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: erase_status_tag:
Deleting xpath: //node_state[@uname='sc-node-2']/transient_attributes
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: update_attrd:
Connecting to attrd... 5 retries remaining
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_delete operation for section 
//node_state[@uname='sc-node-2']/transient_attributes: OK (rc=0, 
origin=local/crmd/16, version=0.11.0)
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: crm_update_peer_join:
do_dc_join_ack: Node sc-node-2[2] - join-1 phase 3 -> 4
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: do_dc_join_ack:  join-1: 
Updating node state to member for sc-node-2
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: erase_status_tag:
Deleting xpath: //node_state[@uname='sc-node-2']/lrm
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_delete operation for section 
//node_state[@uname='sc-node-2']/lrm: OK (rc=0, origin=local/crmd/17, 
version=0.11.0)
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/18, version=0.11.1)
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: do_state_transition: 
State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
cause=C_FSA_INTERNAL origin=check_join_state ]
Oct 29 13:02:15 [1021] sc-node-2   crmd: info: abort_transition_graph:  
do_te_invoke:151 - Triggered transition abort (complete=1) : Peer Cancelled
Oct 29 13:02:15 [1019] sc-node-2  attrd:   notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Oct 29 13:02:15 [1016] sc-node-2cib: info: cib_process_request: 
Completed cib_modify operation for section nodes: