Hi,

I have a strange situation, which I would like to ask about, whether it is
a bug, misconfiguration or an intended behavior.

A disconnected node does not detect it is lost, and does not perform any
actions to stop, even though resource agents report errors when monitored,
just the number of processes (of some hanged resource agents) keeps growing.

Seems like pacemaker ignores timeouts when trying to update CIB.

The situation is caused by corosync not detecting lost quorum due to
firewall blocking lo. As far as I checked this prevents corosync from
detecting problems with the cluster, and when lo access is restored
everything should be fine, but shouldn't pacemaker detect lost CIB service
and do something about it? Maybe there is a configuration parameter to
control this?

Technical details:

1)
1.1) machine: Amazon Linux: Linux ... 3.10.35-43.137.amzn1.x86_64 #1 SMP
Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
1.2) Pacemaker:  Pacemaker 1.1.9-1512.el6
1.3) corosync: Corosync Cluster Engine, version '2.3.2'


2) Net: basic: ethx, lo

3) iptables:
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -p tcp -m tcp -s <my_machine> --dport 22 -j ACCEPT
-A INPUT -j DROP
-A OUTPUT -p tcp -m tcp -d <my_machine> --sport 22 -j ACCEPT
-A OUTPUT -j DROP
COMMIT

4) crm config:
<crm_config>
  <cluster_property_set id="cib-bootstrap-options">
    <nvpair id="cib-bootstrap-options-stonith-enabled"
name="stonith-enabled" value="false"/>
    <nvpair id="cib-bootstrap-options-no-quorum-policy"
name="no-quorum-policy" value="stop"/>
    <nvpair id="cib-bootstrap-options-stop-orphan-resources"
name="stop-orphan-resources" value="true"/>
    <nvpair id="cib-bootstrap-options-start-failure-is-fatal"
name="start-failure-is-fatal" value="true"/>
    <nvpair id="cib-bootstrap-options-expected-quorum-votes"
name="expected-quorum-votes" value="3"/>
    <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.1.9-1512.el6-2a917dd"/>
    <nvpair id="cib-bootstrap-options-cluster-infrastructure"
name="cluster-infrastructure" value="corosync"/>
  </cluster_property_set>
</crm_config>


5) Example resource config:
    <primitive class="ocf" id="dbx_ready_nodes" provider="dbxcl" type="
ready.ocf.sh">
      <instance_attributes id="dbx_ready_nodes-instance_attributes">
        <nvpair id="dbx_ready_nodes-instance_attributes-dbxclrole"
name="dbxclrole" value="&apos;&apos;"/>
      </instance_attributes>
      <operations>
        <op id="dbx_ready_nodes-start-timeout-1min-on-fail-stop"
interval="0s" name="start" on-fail="stop" timeout="1min"/>
        <op id="dbx_ready_nodes-stop-timeout-8min" interval="0s"
name="stop" timeout="8min"/>
        <op id="dbx_ready_nodes-monitor-interval-83s" interval="83s"
name="monitor" on-fail="stop" timeout="60s"/>
        <op id="dbx_ready_nodes-validate-all-interval-29s" interval="29s"
name="validate-all" on-fail="stop" timeout="60s"/>
      </operations>
    </primitive>


6) Logs:
Below a resource "dbx_ready_nodes" monitor action returns error, but
nothing happens, the resource is not being requested to stop (even though
it should, as can be seen above)

May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug:
operation_finished:      dbx_ready_nodes_monitor_83000:8669 - exited with
rc=1
May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug:
log_finished:    finished - rsc:dbx_ready_nodes action:monitor call_id:142
pid:8669 exit-code:1 exec-time:0ms queue-time:0ms
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
sendmsg(mcast) failed (non-critical): Operation not permitted (1)
May 02 20:04:13 [16154] ip-10-116-169-85 corosync warning [MAIN  ] Totem is
unable to form a cluster because of an operating system or network fault.
The most common cause of this message is that th
e local firewall is configured improperly.


Thanks in advance

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to