Poki, Once again, I must apologize for presenting you and the users group with some mis-information. After triple checking my note log ... it seems that I described the two actions to you backwards, as it was the kill, not the gentle shutdown that I had issues with, and I had done them in the reverse order.
In reality, I had attempted the individual `pcs cluster kill` commands first, and it was the kills that were ineffective... and by "ineffective", I mean that the resources were not stopping (I impatiently only waited about 4 minutes before making that determination). I then ran the `pcs cluster stop --all` which seemed to work... and by "work", I mean subsequent attempts to issue pcs status returned the message: "Error: cluster is not currently running on this node". You are probably wondering why I would choose the last resort controversial "kill" method over the gentler stop method as my first attempt to stop the cluster. I do not usually do this first, but... so many times I have tried using 'stop' when I have resources in 'failed" state, it takes up to 20 minutes for the stop to quiesce / complete. So, in this case I thought I'd expedite things and try the kill first and see what happens. I'm wondering now if having orphaned virtual domains running is the expected behavior after the kill? Full disclosure / with time stamps ... (and a bit of a digression) What led up to shutting down the cluster in the first place was that I had numerous VirtualDomain resources and their domains running on multiple hosts concurrently, a disastrous situation which resulted in corruption of many of my virtual images volumes. For example: zs95kjg109082_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs93kjpcs1 ] zs95kjg109083_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs95kjpcs1 ] zs95kjg109084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109085_res (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1 zs93kjpcs1 ] zs95kjg109086_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 zs95kjg109087_res (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1 zs95kjg109088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109089_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs93kjpcs1 ] zs95kjg109090_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 zs95kjg109091_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs95kjpcs1 ] zs95kjg109092_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109094_res (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1 zs93kjpcs1 ] zs95kjg109095_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 zs95kjg109096_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs95kjpcs1 ] zs95kjg109097_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109099_res (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1 zs93kjpcs1 ] zs95kjg109100_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 zs95kjg109101_res (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1 zs95kjg109102_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109104_res (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1 zs95kjg109105_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1 zs95kjpcs1 ] I also had numerous FAILED resources, which... I strongly suspect, were due to corruption of the virtual system image volumes, which I later had to recover via fsck. zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 zs95kjg110100_res (ocf::heartbeat:VirtualDomain): FAILED zs95kjpcs1 zs95kjg110101_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs95KLpcs1 zs95kjg110098_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1 WebSite (ocf::heartbeat:apache): FAILED zs95kjg110090 fence_S90HMC1 (stonith:fence_ibmz): Started zs95kjpcs1 zs95kjg110105_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1 zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs95KLpcs1 zs95kjg110107_res (ocf::heartbeat:VirtualDomain): FAILED zs95kjpcs1 zs95kjg110108_res (ocf::heartbeat:VirtualDomain): FAILED[ zs95kjpcs1 zs93kjpcs1 ] The pacemaker logs were jam packed with this message for many of my VirtualDomain resources: Sep 07 15:10:50 [32366] zs93kl crm_resource: ( native.c:97 ) debug: native_add_running: zs95kjg110195_res is active on 2 nodes including zs93kjpcs1: attempting recovery I actually reported an earlier occurrence of this in an earlier thread, subject: "[ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure". This happens to be a more recent occurrence of that issue, which we suspect was caused by rebooting a cluster node without first stopping pacemaker on that host. We typically put the node in 'cluster standby', wait for resources to move away from the node, and then issue 'reboot'. The reboot action on our LPARs is configured to perform a halt (deactivate), and then activate. It is not a graceful system shutdown. (end of digression). Anyway, in an attempt to stabilize and recover this mess... I did the cluster kills, followed by the cluster stop all as follows: [root@zs95kj VD]# date;pcs cluster kill Wed Sep 7 15:28:26 EDT 2016 [root@zs95kj VD]# [root@zs93kl VD]# date;pcs cluster kill Wed Sep 7 15:28:44 EDT 2016 [root@zs95kj VD]# date;pcs cluster kill Wed Sep 7 15:29:06 EDT 2016 [root@zs95kj VD]# [root@zs95KL VD]# date;pcs cluster kill Wed Sep 7 15:29:30 EDT 2016 [root@zs95KL VD]# [root@zs93kj ~]# date;pcs cluster kill Wed Sep 7 15:30:06 EDT 2016 [root@zs93kj ~]# [root@zs95kj VD]# pcs status |less Cluster name: test_cluster_2 Last updated: Wed Sep 7 15:31:24 2016 Last change: Wed Sep 7 15:14:07 2016 by hacluster via crmd on zs93kjpcs1 Stack: corosync Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 106 nodes and 304 resources configured Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ] OFFLINE: [ zs90kppcs1 ] As I said, I waited about 4 minutes and it didn't look like the resources were stopping, and it also didn't look like the nodes were going offline (note, zs90kppcs1 was shut down, so of course it's offline). So, impatiently, I then did the cluster stop, which surprisingly completed very quickly ... [root@zs95kj VD]# date;pcs cluster stop --all Wed Sep 7 15:32:27 EDT 2016 zs90kppcs1: Unable to connect to zs90kppcs1 ([Errno 113] No route to host) zs93kjpcs1: Stopping Cluster (pacemaker)... zs95kjpcs1: Stopping Cluster (pacemaker)... zs95KLpcs1: Stopping Cluster (pacemaker)... zs93KLpcs1: Stopping Cluster (pacemaker)... Error: unable to stop all nodes zs90kppcs1: Unable to connect to zs90kppcs1 ([Errno 113] No route to host) You have new mail in /var/spool/mail/root This was when I discovered that the virtual domains themselves were still running on all the hosts, I ssh'ed this script to "destroy" (forcible shutdown) them... [root@zs95kj VD]# cat destroyall.sh for guest in `virsh list |grep running |awk '{print $2}'`; do virsh destroy $guest; done [root@zs95kj VD]# ./destroyall.sh Domain zs95kjg110190 destroyed Domain zs95kjg110211 destroyed . . (omitted dozens of Domain xxx destroyed messages) . and then checked them: [root@zs95kj VD]# for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 ; do ssh $host virsh list; done Id Name State ---------------------------------------------------- Id Name State ---------------------------------------------------- Id Name State ---------------------------------------------------- Id Name State ---------------------------------------------------- Next, I decided to run this quorum test...which resulted in the unexpected behavior (as originally reported in this thread): TEST: With pacemaker initially down on all nodes, start cluster on one cluster node at a time, and see what happens when we reach quorum at 3. [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 ; do ssh $host pcs status; done Wed Sep 7 15:49:27 EDT 2016 Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node [root@zs95kj VD]# date;pcs cluster start Wed Sep 7 15:50:00 EDT 2016 Starting Cluster... [root@zs95kj VD]# [root@zs95kj VD]# while true; do date;./ckrm.sh; sleep 10; done Wed Sep 7 15:50:09 EDT 2016 ### VirtualDomain Resource Statistics: ### "_res" Virtual Domain resources: Started on zs95kj: 0 Started on zs93kj: 0 Started on zs95KL: 0 Started on zs93KL: 0 Started on zs90KP: 0 Total Started: 0 Total NOT Started: 200 To my surprise, the resources are starting up on zs95kj. Apparently, I have quorum? [root@zs95kj VD]# date;pcs status |less Wed Sep 7 15:51:17 EDT 2016 Cluster name: test_cluster_2 Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12 2016 by hacluster via crmd on zs93kjpcs1 Stack: corosync Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum <<< WHY DO I HAVE QUORUM? 106 nodes and 304 resources configured Node zs93KLpcs1: pending Node zs93kjpcs1: pending Node zs95KLpcs1: pending Online: [ zs95kjpcs1 ] OFFLINE: [ zs90kppcs1 ] Full list of resources: zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109064_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109065_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109066_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109067_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 zs95kjg109068_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 ., . . PCSD Status: zs93kjpcs1: Online zs95kjpcs1: Online zs95KLpcs1: Online zs90kppcs1: Offline zs93KLpcs1: Online Check resources again: Wed Sep 7 16:09:52 EDT 2016 ### VirtualDomain Resource Statistics: ### "_res" Virtual Domain resources: Started on zs95kj: 199 Started on zs93kj: 0 Started on zs95KL: 0 Started on zs93KL: 0 Started on zs90KP: 0 Total Started: 199 Total NOT Started: 1 I have since isolated all the corrupted virtual domain images and disabled their VirtualDomain resources. We already rebooted all five cluster nodes, after installing a new KVM driver on them. Now, the quorum calculation and behavior seems to be working perfectly as expected. I started pacemaker on the nodes, one at a time... and, after 3 of the 5 nodes had pacemaker "Online" ... resources activated and were evenly distributed across them. In summary, a lesson learned here is to check status of the pcs process to be certain pacemaker and corosync are indeed "offline" and that all threads to that process have terminated. You had mentioned this command: pstree -p | grep -A5 $(pidof -x pcs) I'm not quite sure what the $(pidof -x pcs) represents?? On an "Online" cluster node, I see: [root@zs93kj ~]# ps -ef |grep pcs |grep -v grep root 18876 1 0 Sep07 ? 00:00:00 /bin/sh /usr/lib/pcsd/pcsd start root 18905 18876 0 Sep07 ? 00:00:00 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb root 18906 18905 0 Sep07 ? 00:04:22 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb [root@zs93kj ~]# If I use the 18876 PID on a healthy node, I get.. [root@zs93kj ~]# pstree -p |grep -A5 18876 |-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102) | |-{ruby}(20212) | `-{ruby}(224258) |-pkcsslotd(18851) |-polkitd(19091)-+-{polkitd}(19100) | |-{polkitd}(19101) Is this what you meant for me to do? If so, I'll be sure to do that next time I suspect processes are not exiting on cluster kill or stop. Thanks Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966 From: Jan Pokorný <jpoko...@redhat.com> To: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org> Cc: Si Bo Niu <nius...@cn.ibm.com>, Scott Loveland/Poughkeepsie/IBM@IBMUS, Michael Tebolt/Poughkeepsie/IBM@IBMUS Date: 09/08/2016 02:43 PM Subject: Re: [ClusterLabs] Pacemaker quorum behavior On 08/09/16 10:20 -0400, Scott Greenlese wrote: > Correction... > > When I stopped pacemaker/corosync on the four (powered on / active) > cluster node hosts, I was having an issue with the gentle method of > stopping the cluster (pcs cluster stop --all), Can you elaborate on what went wrong with this gentle method, please? If it seemed to have stuck, you can perhaps run some diagnostics like: pstree -p | grep -A5 $(pidof -x pcs) across the nodes to see if what process(es) pcs waits on, next time. > so I ended up doing individual (pcs cluster kill <cluster_node>) on > each of the four cluster nodes. I then had to stop the virtual > domains manually via 'virsh destroy <guestname>' on each host. > Perhaps there was some residual node status affecting my quorum? Hardly if corosync processes were indeed dead. -- Jan (Poki) [attachment "attyopgs.dat" deleted by Scott Greenlese/Poughkeepsie/IBM] _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org