Hi Ken,

Thanks for the clarification. Now I have another real problem that needs your advise.

The cluster consists of 5 nodes and one of the node got a 1 second network failure which resulted in one of the VirtualDomain resources to start on two nodes at the same time. The cluster property no_quorum_policy is set to stop.

At 16:13:34, this happened:
16:13:34 zs95kj attrd[133000]: notice: crm_update_peer_proc: Node zs93KLpcs1[5] - state is now lost (was member) 16:13:34 zs95kj corosync[132974]: [CPG ] left_list[0] group:pacemakerd\x00, ip:r(0) ip(10.20.93.13) , pid:28721
16:13:34 zs95kj crmd[133002]: warning: No match for shutdown action on 5
16:13:34 zs95kj attrd[133000]: notice: Removing all zs93KLpcs1 attributes for attrd_peer_change_cb
16:13:34 zs95kj corosync[132974]:  [CPG   ] left_list_entries:1
16:13:34 zs95kj crmd[133002]: notice: Stonith/shutdown of zs93KLpcs1 not matched
...
16:13:35 zs95kj attrd[133000]: notice: crm_update_peer_proc: Node zs93KLpcs1[5] - state is now member (was (null))

From the DC:
[root@zs95kj ~]# crm_simulate --xml-file /var/lib/pacemaker/pengine/pe-input-3288.bz2 |grep 110187 zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 <----------This is the baseline that everything works normal

[root@zs95kj ~]# crm_simulate --xml-file /var/lib/pacemaker/pengine/pe-input-3289.bz2 |grep 110187 zs95kjg110187_res (ocf::heartbeat:VirtualDomain): Stopped <----------- Here the node zs93KLpcs1 lost it's network for 1 sec and resulted in this state.

[root@zs95kj ~]# crm_simulate --xml-file /var/lib/pacemaker/pengine/pe-input-3290.bz2 |grep 110187
 zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Stopped

[root@zs95kj ~]# crm_simulate --xml-file /var/lib/pacemaker/pengine/pe-input-3291.bz2 |grep 110187
 zs95kjg110187_res      (ocf::heartbeat:VirtualDomain): Stopped


From the DC's pengine log, it has:
16:05:01 zs95kj pengine[133001]: notice: Calculated Transition 238: /var/lib/pacemaker/pengine/pe-input-3288.bz2
...
16:13:41 zs95kj pengine[133001]: notice: Start zs95kjg110187_res#011(zs90kppcs1)
...
16:13:41 zs95kj pengine[133001]: notice: Calculated Transition 239: /var/lib/pacemaker/pengine/pe-input-3289.bz2

From the DC's CRMD log, it has:
Sep 9 16:05:25 zs95kj crmd[133002]: notice: Transition 238 (Complete=48, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3288.bz2): Complete
...
Sep 9 16:13:42 zs95kj crmd[133002]: notice: Initiating action 752: start zs95kjg110187_res_start_0 on zs90kppcs1
...
Sep 9 16:13:56 zs95kj crmd[133002]: notice: Transition 241 (Complete=81, Pending=0, Fired=0, Skipped=172, Incomplete=341, Source=/var/lib/pacemaker/pengine/pe-input-3291.bz2): Stopped

Here I do not see any log about pe-input-3289.bz2 and pe-input-3290.bz2. Why is this?

From the log on zs93KLpcs1 where guest 110187 was running, i do not see any message regarding stopping this resource after it lost its connection to the cluster.

Any ideas where to look for possible cause?

On 11/3/2016 1:02 AM, Ken Gaillot wrote:
On 11/02/2016 11:17 AM, Niu Sibo wrote:
Hi all,

I have a general question regarding the fence login in pacemaker.

I have setup a three nodes cluster with Pacemaker 1.1.13 and cluster
property no_quorum_policy set to ignore. When two nodes lost their NIC
corosync is running on at the same time, it looks like the two nodes are
getting fenced one by one, even I have three fence devices defined for
each of the node.

What should I be expecting in the case?
It's probably coincidence that the fencing happens serially; there is
nothing enforcing that for separate fence devices. There are many steps
in a fencing request, so they can easily take different times to complete.

I noticed if the node rejoins the cluster before the cluster starts the
fence actions, some resources will get activated on 2 nodes at the
sametime. This is really not good if the resource happens to be
VirutalGuest.  Thanks for any suggestions.
Since you're ignoring quorum, there's nothing stopping the disconnected
node from starting all resources on its own. It can even fence the other
nodes, unless the downed NIC is used for fencing. From that node's point
of view, it's the other two nodes that are lost.

Quorum is the only solution I know of to prevent that. Fencing will
correct the situation, but it won't prevent it.

See the votequorum(5) man page for various options that can affect how
quorum is calculated. Also, the very latest version of corosync supports
qdevice (a lightweight daemon that run on a host outside the cluster
strictly for the purposes of quorum).

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to