Re: [ClusterLabs] Pacemaker quorum behavior
On 28/09/16 16:30 -0400, Scott Greenlese wrote: > Also, I have tried simulating a failed cluster node (to trigger a > STONITH action) by killing the corosync daemon on one node, but all > that does is respawn the daemon ... causing a temporary / transient > failure condition, and no fence takes place. Is there a way to > kill corosync in such a way that it stays down? Is there a best > practice for STONITH testing? This makes me seriously wonder what could cause this involuntary daemon-scoped high availability... Are you sure you are using upstream provided initscript/unit file? (Just hope there's no fence_corosync_restart.) -- Jan (Poki) pgpbBS3jdAmnn.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
Dne 29.9.2016 v 00:14 Ken Gaillot napsal(a): On 09/28/2016 03:57 PM, Scott Greenlese wrote: A quick addendum... After sending this post, I decided to stop pacemaker on the single, Online node in the cluster, and this effectively killed the corosync daemon: [root@zs93kl VD]# date;pcs cluster stop Wed Sep 28 16:39:22 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... Correct, "pcs cluster stop" tries to stop both pacemaker and corosync. [root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep Wed Sep 28 16:46:19 EDT 2016 Totally irrelevant, but a little trick I picked up somewhere: when grepping for a process, square-bracketing a character lets you avoid the "grep -v", e.g. "ps -ef | grep cor[o]" It's nice when I remember to use it ;) [root@zs93kl VD]# Next, I went to a node in "Pending" state, and sure enough... the pcs cluster stop killed the daemon there, too: [root@zs95kj VD]# date;pcs cluster stop Wed Sep 28 16:48:15 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep Wed Sep 28 16:48:38 EDT 2016 [root@zs95kj VD]# So, this answers my own question... cluster stop should kill corosync. So, why isn't the `pcs cluster stop --all` failing to kill corosync? It should. At least you've narrowed it down :) This is a bug in pcs. Thanks for spotting it and providing detailed description. I filed the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1380372 Regards, Tomas Thanks... Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown. From: Scott Greenlese/Poughkeepsie/IBM To: kgail...@redhat.com, Cluster Labs - All topics related to open-source clustering welcomed Date: 09/28/2016 04:30 PM Subject: Re: [ClusterLabs] Pacemaker quorum behavior Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown. Basically, what should happen to corosync on a cluster node when pacemaker is shutdown on that node? On my 5 node cluster, when I do a global shutdown, the pacemaker processes exit, but corosync processes remain active. Here's an example of where this led me into some trouble... My cluster is still configured to use the "symmetric" resource distribution. I don't have any location constraints in place, so pacemaker tries to evenly distribute resources across all Online nodes. With one cluster node (KVM host) powered off, I did the global cluster stop: [root@zs90KP VD]# date;pcs cluster stop --all Wed Sep 28 15:07:40 EDT 2016 zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) zs90kppcs1: Stopping Cluster (pacemaker)... zs95KLpcs1: Stopping Cluster (pacemaker)... zs95kjpcs1: Stopping Cluster (pacemaker)... zs93kjpcs1: Stopping Cluster (pacemaker)... Error: unable to stop all nodes zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) Note: The "No route to host" messages are expected because that node / LPAR is powered down. (I don't show it here, but the corosync daemon is still running on the 4 active nodes. I do show it later). I then powered on the one zs93KLpcs1 LPAR, so in theory I should not have quorum when it comes up and activates pacemaker, which is enabled to autostart at boot time on all 5 cluster nodes. At this point, only 1 out of 5 nodes should be Online to the cluster, and therefore ... no quorum. I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' Online, and "partition with quorum": Corosync determines quorum, pacemaker just uses it. If corosync is running, the node contributes to quorum. [root@zs93kl ~]# date;pcs status |less Wed Sep 28 15:25:13 EDT 2016 Cluster name: test_cluster_2 Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08 2016 by root via crm_resource on zs95kjpcs1 Stack: corosync Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 106 nodes and 304 resources configured Node zs90kppcs1: pending Node zs93kjpcs1: pending Node zs95KLpcs1: pending Node zs95kjpcs1: pending Online: [ zs93KLpcs1 ] Full list of resources: zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 . . . Here you can see that corosync is up on all 5 nodes: [root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep -v grep"; done Wed Sep 28 15:22:21 EDT 201
Re: [ClusterLabs] Pacemaker quorum behavior
On 09/28/2016 03:57 PM, Scott Greenlese wrote: > A quick addendum... > > After sending this post, I decided to stop pacemaker on the single, > Online node in the cluster, > and this effectively killed the corosync daemon: > > [root@zs93kl VD]# date;pcs cluster stop > Wed Sep 28 16:39:22 EDT 2016 > Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... Correct, "pcs cluster stop" tries to stop both pacemaker and corosync. > [root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep > Wed Sep 28 16:46:19 EDT 2016 Totally irrelevant, but a little trick I picked up somewhere: when grepping for a process, square-bracketing a character lets you avoid the "grep -v", e.g. "ps -ef | grep cor[o]" It's nice when I remember to use it ;) > [root@zs93kl VD]# > > > > Next, I went to a node in "Pending" state, and sure enough... the pcs > cluster stop killed the daemon there, too: > > [root@zs95kj VD]# date;pcs cluster stop > Wed Sep 28 16:48:15 EDT 2016 > Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... > > [root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep > Wed Sep 28 16:48:38 EDT 2016 > [root@zs95kj VD]# > > So, this answers my own question... cluster stop should kill corosync. > So, why isn't the `pcs cluster stop --all` failing to > kill corosync? It should. At least you've narrowed it down :) > Thanks... > > > Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > > > > Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi > folks.. I have some follow-up questions about corosync Scott > Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up > questions about corosync daemon status after cluster shutdown. > > From: Scott Greenlese/Poughkeepsie/IBM > To: kgail...@redhat.com, Cluster Labs - All topics related to > open-source clustering welcomed > Date: 09/28/2016 04:30 PM > Subject: Re: [ClusterLabs] Pacemaker quorum behavior > > > > > Hi folks.. > > I have some follow-up questions about corosync daemon status after > cluster shutdown. > > Basically, what should happen to corosync on a cluster node when > pacemaker is shutdown on that node? > On my 5 node cluster, when I do a global shutdown, the pacemaker > processes exit, but corosync processes remain active. > > Here's an example of where this led me into some trouble... > > My cluster is still configured to use the "symmetric" resource > distribution. I don't have any location constraints in place, so > pacemaker tries to evenly distribute resources across all Online nodes. > > With one cluster node (KVM host) powered off, I did the global cluster > stop: > > [root@zs90KP VD]# date;pcs cluster stop --all > Wed Sep 28 15:07:40 EDT 2016 > zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) > zs90kppcs1: Stopping Cluster (pacemaker)... > zs95KLpcs1: Stopping Cluster (pacemaker)... > zs95kjpcs1: Stopping Cluster (pacemaker)... > zs93kjpcs1: Stopping Cluster (pacemaker)... > Error: unable to stop all nodes > zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) > > Note: The "No route to host" messages are expected because that node / > LPAR is powered down. > > (I don't show it here, but the corosync daemon is still running on the 4 > active nodes. I do show it later). > > I then powered on the one zs93KLpcs1 LPAR, so in theory I should not > have quorum when it comes up and activates > pacemaker, which is enabled to autostart at boot time on all 5 cluster > nodes. At this point, only 1 out of 5 > nodes should be Online to the cluster, and therefore ... no quorum. > > I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' > Online, and "partition with quorum": Corosync determines quorum, pacemaker just uses it. If corosync is running, the node contributes to quorum. > [root@zs93kl ~]# date;pcs status |less > Wed Sep 28 15:25:13 EDT 2016 > Cluster name: test_cluster_2 > Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08 > 2016 by root via crm_resource on zs95kjpcs1 > Stack: corosync > Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - > partition with quorum > 106 nodes and 304 resources configured > > Node zs90kppcs1: pending > Node zs93kjpcs1: pending > Node zs95KLpcs1: pending > Node zs95kjpcs1: pending > Online: [ zs93KLpcs1 ] > > Full list of resources: > > zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started
Re: [ClusterLabs] Pacemaker quorum behavior
A quick addendum... After sending this post, I decided to stop pacemaker on the single, Online node in the cluster, and this effectively killed the corosync daemon: [root@zs93kl VD]# date;pcs cluster stop Wed Sep 28 16:39:22 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep Wed Sep 28 16:46:19 EDT 2016 [root@zs93kl VD]# Next, I went to a node in "Pending" state, and sure enough... the pcs cluster stop killed the daemon there, too: [root@zs95kj VD]# date;pcs cluster stop Wed Sep 28 16:48:15 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep Wed Sep 28 16:48:38 EDT 2016 [root@zs95kj VD]# So, this answers my own question... cluster stop should kill corosync. So, why isn't the `pcs cluster stop --all` failing to kill corosync? Thanks... Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: Scott Greenlese/Poughkeepsie/IBM To: kgail...@redhat.com, Cluster Labs - All topics related to open-source clustering welcomed Date: 09/28/2016 04:30 PM Subject:Re: [ClusterLabs] Pacemaker quorum behavior Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown. Basically, what should happen to corosync on a cluster node when pacemaker is shutdown on that node? On my 5 node cluster, when I do a global shutdown, the pacemaker processes exit, but corosync processes remain active. Here's an example of where this led me into some trouble... My cluster is still configured to use the "symmetric" resource distribution. I don't have any location constraints in place, so pacemaker tries to evenly distribute resources across all Online nodes. With one cluster node (KVM host) powered off, I did the global cluster stop: [root@zs90KP VD]# date;pcs cluster stop --all Wed Sep 28 15:07:40 EDT 2016 zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) zs90kppcs1: Stopping Cluster (pacemaker)... zs95KLpcs1: Stopping Cluster (pacemaker)... zs95kjpcs1: Stopping Cluster (pacemaker)... zs93kjpcs1: Stopping Cluster (pacemaker)... Error: unable to stop all nodes zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) Note: The "No route to host" messages are expected because that node / LPAR is powered down. (I don't show it here, but the corosync daemon is still running on the 4 active nodes. I do show it later). I then powered on the one zs93KLpcs1 LPAR, so in theory I should not have quorum when it comes up and activates pacemaker, which is enabled to autostart at boot time on all 5 cluster nodes. At this point, only 1 out of 5 nodes should be Online to the cluster, and therefore ... no quorum. I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' Online, and "partition with quorum": [root@zs93kl ~]# date;pcs status |less Wed Sep 28 15:25:13 EDT 2016 Cluster name: test_cluster_2 Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08 2016 by root via crm_resource on zs95kjpcs1 Stack: corosync Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 106 nodes and 304 resources configured Node zs90kppcs1: pending Node zs93kjpcs1: pending Node zs95KLpcs1: pending Node zs95kjpcs1: pending Online: [ zs93KLpcs1 ] Full list of resources: zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 . . . Here you can see that corosync is up on all 5 nodes: [root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep -v grep"; done Wed Sep 28 15:22:21 EDT 2016 zs90KP root 155374 1 0 Sep26 ?00:10:17 corosync zs95KL root 22933 1 0 11:51 ?00:00:54 corosync zs95kj root 19382 1 0 Sep26 ?00:10:15 corosync zs93kj root 129102 1 0 Sep26 ?00:12:10 corosync zs93kl root 21894 1 0 15:19 ?00:00:00 corosync But, pacemaker is only running on the one, online node: [root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd | grep -v grep"; done Wed Sep 28 15:23:29 EDT 2016 zs90KP zs95KL zs95kj zs93kj zs93kl root 23005 1 0 15:19 ?00:00:00 /usr/sbin/pacemakerd -f You have new mail in /var/spool/mail/root [root@zs95kj VD]# This situation wreaks havoc on my VirtualDomain resources, as the majority of them are in FAILED or Stopped state, and to my surprise... many of them show as Started: [root@zs93kl VD]# date;pcs resource show |grep zs93KL Wed Sep 28 15:55:29 EDT 2016 zs95kjg109062_res (ocf::heartbea
Re: [ClusterLabs] Pacemaker quorum behavior
::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110122_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110123_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110125_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110126_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110128_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110129_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110130_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110131_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110132_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110133_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110134_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110135_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110137_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110138_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110140_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110141_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110143_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110144_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110145_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110146_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110149_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110150_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110154_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110155_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110156_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110159_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110160_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110164_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg110165_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110166_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 Pacemaker is attempting to activate all VirtualDomain resources on the one cluster node. So back to my original question... what should happen when I do a cluster stop? If it should be deactivating, what would prevent this? Also, I have tried simulating a failed cluster node (to trigger a STONITH action) by killing the corosync daemon on one node, but all that does is respawn the daemon ... causing a temporary / transient failure condition, and no fence takes place. Is there a way to kill corosync in such a way that it stays down? Is there a best practice for STONITH testing? As usual, thanks in advance for your advice. Scott Greenlese ... IBM KVM on System Z - Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: Ken Gaillot To: users@clusterlabs.org Date: 09/09/2016 06:23 PM Subject:Re: [ClusterLabs] Pacemaker quorum behavior On 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/2016 07:31 PM, Scott Greenlese wrote: >> >> Hi Klaus, thanks for your prompt and thoughtful feedback... >> >> Please see my answers nested below (sections entitled, "Scott's >> Reply"). Thanks! >> >> - Scott >> >> >> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. >> INTERNET: swgre...@us.ibm.com >> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 >> >> >> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27 >> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger >> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese >> wrote: > >> >> From: Klaus Wenninger >> To: users@clusterlabs.org >> Date: 09/08/2016 10:59 AM >> Subject: Re: [ClusterLabs] Pacemaker quorum behavior >> >> >> >> >> >> On 09/08/2016 03:55 PM, Scott Greenlese wrote: >> > >> > Hi all... >> > >> > I have a few very basic questions for the group. >> > >> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 >> > VirtualDomain pacemaker-remote nodes
Re: [ClusterLabs] Pacemaker quorum behavior
On 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/2016 07:31 PM, Scott Greenlese wrote: >> >> Hi Klaus, thanks for your prompt and thoughtful feedback... >> >> Please see my answers nested below (sections entitled, "Scott's >> Reply"). Thanks! >> >> - Scott >> >> >> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. >> INTERNET: swgre...@us.ibm.com >> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 >> >> >> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27 >> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger >> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese >> wrote: > >> >> From: Klaus Wenninger >> To: users@clusterlabs.org >> Date: 09/08/2016 10:59 AM >> Subject: Re: [ClusterLabs] Pacemaker quorum behavior >> >> >> >> >> >> On 09/08/2016 03:55 PM, Scott Greenlese wrote: >> > >> > Hi all... >> > >> > I have a few very basic questions for the group. >> > >> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 >> > VirtualDomain pacemaker-remote nodes >> > plus 100 "opaque" VirtualDomain resources. The cluster is configured >> > to be 'symmetric' and I have no >> > location constraints on the 200 VirtualDomain resources (other than to >> > prevent the opaque guests >> > from running on the pacemaker remote node resources). My quorum is set >> > as: >> > >> > quorum { >> > provider: corosync_votequorum >> > } >> > >> > As an experiment, I powered down one LPAR in the cluster, leaving 4 >> > powered up with the pcsd service up on the 4 survivors >> > but corosync/pacemaker down (pcs cluster stop --all) on the 4 >> > survivors. I then started pacemaker/corosync on a single cluster >> > >> >> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but >> did you check the status of the individual services? >> >> Scott's reply: >> >> No, I only assumed that pacemaker was down because I got this back on >> my pcs status >> command from each cluster node: >> >> [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 >> zs93kjpcs1 ; do ssh $host pcs status; done >> Wed Sep 7 15:49:27 EDT 2016 >> Error: cluster is not currently running on this node >> Error: cluster is not currently running on this node >> Error: cluster is not currently running on this node >> Error: cluster is not currently running on this node In my experience, this is sufficient to say that pacemaker and corosync aren't running. >> >> What else should I check? The pcsd.service service was still up, >> since I didn't not stop that >> anywhere. Should I have done, ps -ef |grep -e pacemaker -e corosync >> to check the state before >> assuming it was really down? >> >> > Guess the answer from Poki should guide you well here ... >> >> >> > node (pcs cluster start), and this resulted in the 200 VirtualDomain >> > resources activating on the single node. >> > This was not what I was expecting. I assumed that no resources would >> > activate / start on any cluster nodes >> > until 3 out of the 5 total cluster nodes had pacemaker/corosync running. Your expectation is correct; I'm not sure what happened in this case. There are some obscure corosync options (e.g. last_man_standing, allow_downscale) that could theoretically lead to this, but I don't get the impression you're using anything unusual. >> > After starting pacemaker/corosync on the single host (zs95kjpcs1), >> > this is what I see : >> > >> > [root@zs95kj VD]# date;pcs status |less >> > Wed Sep 7 15:51:17 EDT 2016 >> > Cluster name: test_cluster_2 >> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 >> > 2016 by hacluster via crmd on zs93kjpcs1 >> > Stack: corosync >> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - >> > partition with quorum >> > 106 nodes and 304 resources configured >> > >> > Node zs93KLpcs1: pending >> > Node zs93kjpcs1: pending >> > Node zs95KLpcs1: pending >> > Online: [ zs95kjpcs1 ] >> > OFFLINE: [ zs90kppcs1 ] >> > >> > . >> > . >> > . >> > PCSD Status: >> > zs93
Re: [ClusterLabs] Pacemaker quorum behavior
On 09/09/16 14:13 -0400, Scott Greenlese wrote: > You had mentioned this command: > > pstree -p | grep -A5 $(pidof -x pcs) > > I'm not quite sure what the $(pidof -x pcs) represents?? This is a "command substitution" shell construct (new, blessed form of `backtick` notation) that in this particular case was meant to yield PID of running pcs command. The whole compound command was then meant to possibly discover what pcs is running under the hood because that's what might get stuck. > On an "Online" cluster node, I see: > > [root@zs93kj ~]# ps -ef |grep pcs |grep -v grep > root 18876 1 0 Sep07 ? > 00:00:00 /bin/sh /usr/lib/pcsd/pcsd start > root 18905 18876 0 Sep07 ?00:00:00 /bin/bash -c ulimit -S -c > 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb > root 18906 18905 0 Sep07 ?00:04:22 /usr/bin/ruby > -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb > [root@zs93kj ~]# > > If I use the 18876 PID on a healthy node, I get.. > > [root@zs93kj ~]# pstree -p |grep -A5 18876 >|-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102) >| |-{ruby}(20212) >| `-{ruby}(224258) >|-pkcsslotd(18851) >|-polkitd(19091)-+-{polkitd}(19100) >||-{polkitd}(19101) > > > Is this what you meant for me to do? Only if I got my guess about "pcs cluster stop" command stuck right, which is not the case, as you explained. > If so, I'll be sure to do that next time I suspect processes are not > exiting on cluster kill or stop. In this another case, you really want to consult "systemctl status X" for X in (corosync, pacemaker). And to be really sure, for instance, "pgrep Y" for Y in (pacemakerd, crmd, corosync). (I hope I didn't confuse you too much due to the mentioned wild guess originally). -- Jan (Poki) pgpsaY1UDWg0f.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
ualDomain): Started zs95kjpcs1 zs95kjg109064_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109065_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109066_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109067_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109068_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 ., . . PCSD Status: zs93kjpcs1: Online zs95kjpcs1: Online zs95KLpcs1: Online zs90kppcs1: Offline zs93KLpcs1: Online Check resources again: Wed Sep 7 16:09:52 EDT 2016 ### VirtualDomain Resource Statistics: ### "_res" Virtual Domain resources: Started on zs95kj: 199 Started on zs93kj: 0 Started on zs95KL: 0 Started on zs93KL: 0 Started on zs90KP: 0 Total Started: 199 Total NOT Started: 1 I have since isolated all the corrupted virtual domain images and disabled their VirtualDomain resources. We already rebooted all five cluster nodes, after installing a new KVM driver on them. Now, the quorum calculation and behavior seems to be working perfectly as expected. I started pacemaker on the nodes, one at a time... and, after 3 of the 5 nodes had pacemaker "Online" ... resources activated and were evenly distributed across them. In summary, a lesson learned here is to check status of the pcs process to be certain pacemaker and corosync are indeed "offline" and that all threads to that process have terminated. You had mentioned this command: pstree -p | grep -A5 $(pidof -x pcs) I'm not quite sure what the $(pidof -x pcs) represents?? On an "Online" cluster node, I see: [root@zs93kj ~]# ps -ef |grep pcs |grep -v grep root 18876 1 0 Sep07 ? 00:00:00 /bin/sh /usr/lib/pcsd/pcsd start root 18905 18876 0 Sep07 ?00:00:00 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb root 18906 18905 0 Sep07 ?00:04:22 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb [root@zs93kj ~]# If I use the 18876 PID on a healthy node, I get.. [root@zs93kj ~]# pstree -p |grep -A5 18876 |-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102) | |-{ruby}(20212) | `-{ruby}(224258) |-pkcsslotd(18851) |-polkitd(19091)-+-{polkitd}(19100) ||-{polkitd}(19101) Is this what you meant for me to do?If so, I'll be sure to do that next time I suspect processes are not exiting on cluster kill or stop. Thanks Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 From: Jan Pokorný To: Cluster Labs - All topics related to open-source clustering welcomed Cc: Si Bo Niu , Scott Loveland/Poughkeepsie/IBM@IBMUS, Michael Tebolt/Poughkeepsie/IBM@IBMUS Date: 09/08/2016 02:43 PM Subject:Re: [ClusterLabs] Pacemaker quorum behavior On 08/09/16 10:20 -0400, Scott Greenlese wrote: > Correction... > > When I stopped pacemaker/corosync on the four (powered on / active) > cluster node hosts, I was having an issue with the gentle method of > stopping the cluster (pcs cluster stop --all), Can you elaborate on what went wrong with this gentle method, please? If it seemed to have stuck, you can perhaps run some diagnostics like: pstree -p | grep -A5 $(pidof -x pcs) across the nodes to see if what process(es) pcs waits on, next time. > so I ended up doing individual (pcs cluster kill ) on > each of the four cluster nodes. I then had to stop the virtual > domains manually via 'virsh destroy ' on each host. > Perhaps there was some residual node status affecting my quorum? Hardly if corosync processes were indeed dead. -- Jan (Poki) [attachment "attyopgs.dat" deleted by Scott Greenlese/Poughkeepsie/IBM] ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
On 09/08/2016 07:31 PM, Scott Greenlese wrote: > > Hi Klaus, thanks for your prompt and thoughtful feedback... > > Please see my answers nested below (sections entitled, "Scott's > Reply"). Thanks! > > - Scott > > > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > > Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27 > AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger > ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese > wrote: > > > From: Klaus Wenninger > To: users@clusterlabs.org > Date: 09/08/2016 10:59 AM > Subject: Re: [ClusterLabs] Pacemaker quorum behavior > > > > > > On 09/08/2016 03:55 PM, Scott Greenlese wrote: > > > > Hi all... > > > > I have a few very basic questions for the group. > > > > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 > > VirtualDomain pacemaker-remote nodes > > plus 100 "opaque" VirtualDomain resources. The cluster is configured > > to be 'symmetric' and I have no > > location constraints on the 200 VirtualDomain resources (other than to > > prevent the opaque guests > > from running on the pacemaker remote node resources). My quorum is set > > as: > > > > quorum { > > provider: corosync_votequorum > > } > > > > As an experiment, I powered down one LPAR in the cluster, leaving 4 > > powered up with the pcsd service up on the 4 survivors > > but corosync/pacemaker down (pcs cluster stop --all) on the 4 > > survivors. I then started pacemaker/corosync on a single cluster > > > > "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but > did you check the status of the individual services? > > Scott's reply: > > No, I only assumed that pacemaker was down because I got this back on > my pcs status > command from each cluster node: > > [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 > zs93kjpcs1 ; do ssh $host pcs status; done > Wed Sep 7 15:49:27 EDT 2016 > Error: cluster is not currently running on this node > Error: cluster is not currently running on this node > Error: cluster is not currently running on this node > Error: cluster is not currently running on this node > > > What else should I check? The pcsd.service service was still up, > since I didn't not stop that > anywhere. Should I have done, ps -ef |grep -e pacemaker -e corosync > to check the state before > assuming it was really down? > > Guess the answer from Poki should guide you well here ... > > > > node (pcs cluster start), and this resulted in the 200 VirtualDomain > > resources activating on the single node. > > This was not what I was expecting. I assumed that no resources would > > activate / start on any cluster nodes > > until 3 out of the 5 total cluster nodes had pacemaker/corosync running. > > > > After starting pacemaker/corosync on the single host (zs95kjpcs1), > > this is what I see : > > > > [root@zs95kj VD]# date;pcs status |less > > Wed Sep 7 15:51:17 EDT 2016 > > Cluster name: test_cluster_2 > > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 > > 2016 by hacluster via crmd on zs93kjpcs1 > > Stack: corosync > > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - > > partition with quorum > > 106 nodes and 304 resources configured > > > > Node zs93KLpcs1: pending > > Node zs93kjpcs1: pending > > Node zs95KLpcs1: pending > > Online: [ zs95kjpcs1 ] > > OFFLINE: [ zs90kppcs1 ] > > > > . > > . > > . > > PCSD Status: > > zs93kjpcs1: Online > > zs95kjpcs1: Online > > zs95KLpcs1: Online > > zs90kppcs1: Offline > > zs93KLpcs1: Online > > > > So, what exactly constitutes an "Online" vs. "Offline" cluster node > > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3 > > nodes, > > so where does that fall? Any why "pending"? What does that mean? > > > > Also, what exactly is the cluster's expected reaction to quorum loss? > > Cluster resources will be stopped or something else? > > > Depends on how you configure it using cluster property no-quorum-policy > (default: stop). > > Scott's reply: > > This is how the policy is configured: > > [root@zs95kj VD]# date;pcs config |grep
Re: [ClusterLabs] Pacemaker quorum behavior
On 08/09/16 10:20 -0400, Scott Greenlese wrote: > Correction... > > When I stopped pacemaker/corosync on the four (powered on / active) > cluster node hosts, I was having an issue with the gentle method of > stopping the cluster (pcs cluster stop --all), Can you elaborate on what went wrong with this gentle method, please? If it seemed to have stuck, you can perhaps run some diagnostics like: pstree -p | grep -A5 $(pidof -x pcs) across the nodes to see if what process(es) pcs waits on, next time. > so I ended up doing individual (pcs cluster kill ) on > each of the four cluster nodes. I then had to stop the virtual > domains manually via 'virsh destroy ' on each host. > Perhaps there was some residual node status affecting my quorum? Hardly if corosync processes were indeed dead. -- Jan (Poki) pgp1Re5MQ30mT.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
Hi Klaus, thanks for your prompt and thoughtful feedback... Please see my answers nested below (sections entitled, "Scott's Reply"). Thanks! - Scott Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 From: Klaus Wenninger To: users@clusterlabs.org Date: 09/08/2016 10:59 AM Subject:Re: [ClusterLabs] Pacemaker quorum behavior On 09/08/2016 03:55 PM, Scott Greenlese wrote: > > Hi all... > > I have a few very basic questions for the group. > > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 > VirtualDomain pacemaker-remote nodes > plus 100 "opaque" VirtualDomain resources. The cluster is configured > to be 'symmetric' and I have no > location constraints on the 200 VirtualDomain resources (other than to > prevent the opaque guests > from running on the pacemaker remote node resources). My quorum is set > as: > > quorum { > provider: corosync_votequorum > } > > As an experiment, I powered down one LPAR in the cluster, leaving 4 > powered up with the pcsd service up on the 4 survivors > but corosync/pacemaker down (pcs cluster stop --all) on the 4 > survivors. I then started pacemaker/corosync on a single cluster > "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but did you check the status of the individual services? Scott's reply: No, I only assumed that pacemaker was down because I got this back on my pcs status command from each cluster node: [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 ; do ssh $host pcs status; done Wed Sep 7 15:49:27 EDT 2016 Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node What else should I check? The pcsd.service service was still up, since I didn't not stop that anywhere. Should I have done, ps -ef |grep -e pacemaker -e corosync to check the state before assuming it was really down? > node (pcs cluster start), and this resulted in the 200 VirtualDomain > resources activating on the single node. > This was not what I was expecting. I assumed that no resources would > activate / start on any cluster nodes > until 3 out of the 5 total cluster nodes had pacemaker/corosync running. > > After starting pacemaker/corosync on the single host (zs95kjpcs1), > this is what I see : > > [root@zs95kj VD]# date;pcs status |less > Wed Sep 7 15:51:17 EDT 2016 > Cluster name: test_cluster_2 > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 > 2016 by hacluster via crmd on zs93kjpcs1 > Stack: corosync > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - > partition with quorum > 106 nodes and 304 resources configured > > Node zs93KLpcs1: pending > Node zs93kjpcs1: pending > Node zs95KLpcs1: pending > Online: [ zs95kjpcs1 ] > OFFLINE: [ zs90kppcs1 ] > > . > . > . > PCSD Status: > zs93kjpcs1: Online > zs95kjpcs1: Online > zs95KLpcs1: Online > zs90kppcs1: Offline > zs93KLpcs1: Online > > So, what exactly constitutes an "Online" vs. "Offline" cluster node > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3 > nodes, > so where does that fall? Any why "pending"? What does that mean? > > Also, what exactly is the cluster's expected reaction to quorum loss? > Cluster resources will be stopped or something else? > Depends on how you configure it using cluster property no-quorum-policy (default: stop). Scott's reply: This is how the policy is configured: [root@zs95kj VD]# date;pcs config |grep quorum Thu Sep 8 13:18:33 EDT 2016 no-quorum-policy: stop What should I expect with the 'stop' setting? > > > Where can I find this documentation? > http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ Scott's reply: OK, I'll keep looking thru this doc, but I don't easily find the no-quorum-policy explained. Thanks.. > > > Thanks! > > Scott Greenlese - IBM Solution Test Team. > > > > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___
Re: [ClusterLabs] Pacemaker quorum behavior
On 09/08/2016 03:55 PM, Scott Greenlese wrote: > > Hi all... > > I have a few very basic questions for the group. > > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 > VirtualDomain pacemaker-remote nodes > plus 100 "opaque" VirtualDomain resources. The cluster is configured > to be 'symmetric' and I have no > location constraints on the 200 VirtualDomain resources (other than to > prevent the opaque guests > from running on the pacemaker remote node resources). My quorum is set > as: > > quorum { > provider: corosync_votequorum > } > > As an experiment, I powered down one LPAR in the cluster, leaving 4 > powered up with the pcsd service up on the 4 survivors > but corosync/pacemaker down (pcs cluster stop --all) on the 4 > survivors. I then started pacemaker/corosync on a single cluster > "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but did you check the status of the individual services? > node (pcs cluster start), and this resulted in the 200 VirtualDomain > resources activating on the single node. > This was not what I was expecting. I assumed that no resources would > activate / start on any cluster nodes > until 3 out of the 5 total cluster nodes had pacemaker/corosync running. > > After starting pacemaker/corosync on the single host (zs95kjpcs1), > this is what I see : > > [root@zs95kj VD]# date;pcs status |less > Wed Sep 7 15:51:17 EDT 2016 > Cluster name: test_cluster_2 > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 > 2016 by hacluster via crmd on zs93kjpcs1 > Stack: corosync > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - > partition with quorum > 106 nodes and 304 resources configured > > Node zs93KLpcs1: pending > Node zs93kjpcs1: pending > Node zs95KLpcs1: pending > Online: [ zs95kjpcs1 ] > OFFLINE: [ zs90kppcs1 ] > > . > . > . > PCSD Status: > zs93kjpcs1: Online > zs95kjpcs1: Online > zs95KLpcs1: Online > zs90kppcs1: Offline > zs93KLpcs1: Online > > So, what exactly constitutes an "Online" vs. "Offline" cluster node > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3 > nodes, > so where does that fall? Any why "pending"? What does that mean? > > Also, what exactly is the cluster's expected reaction to quorum loss? > Cluster resources will be stopped or something else? > Depends on how you configure it using cluster property no-quorum-policy (default: stop). > > > Where can I find this documentation? > http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ > > > Thanks! > > Scott Greenlese - IBM Solution Test Team. > > > > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. > INTERNET: swgre...@us.ibm.com > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
Correction... When I stopped pacemaker/corosync on the four (powered on / active) cluster node hosts, I was having an issue with the gentle method of stopping the cluster (pcs cluster stop --all), so I ended up doing individual (pcs cluster kill ) on each of the four cluster nodes. I then had to stop the virtual domains manually via 'virsh destroy ' on each host. Perhaps there was some residual node status affecting my quorum? Thanks... Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301)M/S: POK 42HA/P966 From: Scott Greenlese/Poughkeepsie/IBM@IBMUS To: users@clusterlabs.org Cc: Si Bo Niu , Scott Loveland/Poughkeepsie/IBM@IBMUS, Michael Tebolt/Poughkeepsie/IBM@IBMUS Date: 09/08/2016 10:01 AM Subject:[ClusterLabs] Pacemaker quorum behavior Hi all... I have a few very basic questions for the group. I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 VirtualDomain pacemaker-remote nodes plus 100 "opaque" VirtualDomain resources. The cluster is configured to be 'symmetric' and I have no location constraints on the 200 VirtualDomain resources (other than to prevent the opaque guests from running on the pacemaker remote node resources). My quorum is set as: quorum { provider: corosync_votequorum } As an experiment, I powered down one LPAR in the cluster, leaving 4 powered up with the pcsd service up on the 4 survivors but corosync/pacemaker down (pcs cluster stop --all) on the 4 survivors. I then started pacemaker/corosync on a single cluster node (pcs cluster start), and this resulted in the 200 VirtualDomain resources activating on the single node. This was not what I was expecting. I assumed that no resources would activate / start on any cluster nodes until 3 out of the 5 total cluster nodes had pacemaker/corosync running. After starting pacemaker/corosync on the single host (zs95kjpcs1), this is what I see : [root@zs95kj VD]# date;pcs status |less Wed Sep 7 15:51:17 EDT 2016 Cluster name: test_cluster_2 Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 2016 by hacluster via crmd on zs93kjpcs1 Stack: corosync Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 106 nodes and 304 resources configured Node zs93KLpcs1: pending Node zs93kjpcs1: pending Node zs95KLpcs1: pending Online: [ zs95kjpcs1 ] OFFLINE: [ zs90kppcs1 ] . . . PCSD Status: zs93kjpcs1: Online zs95kjpcs1: Online zs95KLpcs1: Online zs90kppcs1: Offline zs93KLpcs1: Online So, what exactly constitutes an "Online" vs. "Offline" cluster node w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3 nodes, so where does that fall? Any why "pending"? What does that mean? Also, what exactly is the cluster's expected reaction to quorum loss? Cluster resources will be stopped or something else? Where can I find this documentation? Thanks! Scott Greenlese - IBM Solution Test Team. Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org