Hi, I have a strange situation, which I would like to ask about, whether it is a bug, misconfiguration or an intended behavior.
A disconnected node does not detect it is lost, and does not perform any actions to stop, even though resource agents report errors when monitored, just the number of processes (of some hanged resource agents) keeps growing. Seems like pacemaker ignores timeouts when trying to update CIB. The situation is caused by corosync not detecting lost quorum due to firewall blocking lo. As far as I checked this prevents corosync from detecting problems with the cluster, and when lo access is restored everything should be fine, but shouldn't pacemaker detect lost CIB service and do something about it? Maybe there is a configuration parameter to control this? Technical details: 1) 1.1) machine: Amazon Linux: Linux ... 3.10.35-43.137.amzn1.x86_64 #1 SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 1.2) Pacemaker: Pacemaker 1.1.9-1512.el6 1.3) corosync: Corosync Cluster Engine, version '2.3.2' 2) Net: basic: ethx, lo 3) iptables: *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -p tcp -m tcp -s <my_machine> --dport 22 -j ACCEPT -A INPUT -j DROP -A OUTPUT -p tcp -m tcp -d <my_machine> --sport 22 -j ACCEPT -A OUTPUT -j DROP COMMIT 4) crm config: <crm_config> <cluster_property_set id="cib-bootstrap-options"> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="stop"/> <nvpair id="cib-bootstrap-options-stop-orphan-resources" name="stop-orphan-resources" value="true"/> <nvpair id="cib-bootstrap-options-start-failure-is-fatal" name="start-failure-is-fatal" value="true"/> <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="3"/> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.9-1512.el6-2a917dd"/> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> </cluster_property_set> </crm_config> 5) Example resource config: <primitive class="ocf" id="dbx_ready_nodes" provider="dbxcl" type=" ready.ocf.sh"> <instance_attributes id="dbx_ready_nodes-instance_attributes"> <nvpair id="dbx_ready_nodes-instance_attributes-dbxclrole" name="dbxclrole" value="''"/> </instance_attributes> <operations> <op id="dbx_ready_nodes-start-timeout-1min-on-fail-stop" interval="0s" name="start" on-fail="stop" timeout="1min"/> <op id="dbx_ready_nodes-stop-timeout-8min" interval="0s" name="stop" timeout="8min"/> <op id="dbx_ready_nodes-monitor-interval-83s" interval="83s" name="monitor" on-fail="stop" timeout="60s"/> <op id="dbx_ready_nodes-validate-all-interval-29s" interval="29s" name="validate-all" on-fail="stop" timeout="60s"/> </operations> </primitive> 6) Logs: Below a resource "dbx_ready_nodes" monitor action returns error, but nothing happens, the resource is not being requested to stop (even though it should, as can be seen above) May 02 20:04:13 [16191] ip-10-116-169-85 lrmd: debug: operation_finished: dbx_ready_nodes_monitor_83000:8669 - exited with rc=1 May 02 20:04:13 [16191] ip-10-116-169-85 lrmd: debug: log_finished: finished - rsc:dbx_ready_nodes action:monitor call_id:142 pid:8669 exit-code:1 exec-time:0ms queue-time:0ms May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) May 02 20:04:13 [16154] ip-10-116-169-85 corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that th e local firewall is configured improperly. Thanks in advance -- Best Regards, Radoslaw Garbacz XtremeData Incorporation
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org