----- Original Message ----- > Greetings, > > We are using pacemaker and cman in a two-node cluster with no-quorum-policy: > ignore and stonith-enabled: false on a Centos 6 system (pacemaker related > RPM versions are listed below). We are seeing some bizarre (to us) behavior > when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have: > > 1) Fail a resource named "some-resource" started with the > ocf:heartbeat:anything script (or others) on node01 (in our case, it's a > master/slave resource we're pulling observations from, but it can happen on > normal ones). > 2) Wait for Resource to recover. > 3) Fail node02 (reboot -nf, or power loss) > 4) When node02 recovers, we see in /var/log/messages: > - Quorum is recovered > - Sending flush op to all hosts for master-some-resource, > last-failure-some-resource, probe_complete(true), > fail-count-some-resource(1) > - pengine Processing failed op monitor for some-resource on node01: unknown > error (1) > * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not > see the resource agent being called at this time, on either node. > * Sometimes, we see other operations happen that are also not being sent to > the RA, including stop/start > * The resource is actually happilly running on node01 throughtout this whole > process, so there's no reason we should be seeing this failure here. > * This issue is only seen on resources that had not yet been cleaned up. > Resources that were 'clean' when both nodes were last online do not have > this issue. > > We noticed this originally because we are using the ClusterMon RA to report > on different types of errors, and this is giving us false positives. Any > thoughts on configuration issues we could be having, or if this sounds like > a bug in pacemaker somewhere?
This is likely a bug in whatever resource-agent you are using. There's no way for us to know for sure without logs. -- Vossel > > Thanks! > > ---- > Versions: > ccs-0.16.2-69.el6_5.1.x86_64 > clusterlib-3.0.12.1-59.el6_5.2.x86_64 > cman-3.0.12.1-59.el6_5.2.x86_64 > corosync-1.4.1-17.el6_5.1.x86_64 > corosynclib-1.4.1-17.el6_5.1.x86_64 > fence-virt-0.2.3-15.el6.x86_64 > libqb-0.16.0-2.el6.x86_64 > modcluster-0.16.2-28.el6.x86_64 > openais-1.1.1-7.el6.x86_64 > openaislib-1.1.1-7.el6.x86_64 > pacemaker-1.1.10-14.el6_5.3.x86_64 > pacemaker-cli-1.1.10-14.el6_5.3.x86_64 > pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 > pacemaker-libs-1.1.10-14.el6_5.3.x86_64 > pcs-0.9.90-2.el6.centos.3.noarch > resource-agents-3.9.2-40.el6_5.7.x86_64 > ricci-0.16.2-69.el6_5.1.x86_64 > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org