On Thu, 2019-04-18 at 15:51 -0600, JCA wrote: > I have my CentOS two-node cluster, which some of you may already be > sick and tired of reading about: > > # pcs status > Cluster name: FirstCluster > Stack: corosync > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition > with quorum > Last updated: Thu Apr 18 13:52:38 2019 > Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one > > 2 nodes configured > 5 resources configured > > Online: [ one two ] > > Full list of resources: > > MyCluster (ocf::myapp:myapp-script): Started two > Master/Slave Set: DrbdDataClone [DrbdData] > Masters: [ two ] > Slaves: [ one ] > DrbdFS (ocf::heartbeat:Filesystem): Started two > disk_fencing (stonith:fence_scsi): Started one > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > I can stop either node, and the other will take over as expected. > Here is the thing though: > > myapp-script starts, stops and monitors the actual application that I > am interested in. I'll call this application A. At the OS level, A is > of course listed when I do ps awux. > > In the situation above, where A is running on two, I can kill A from > the CentOS command line in two. Shortly after doing so, Pacemaker > invokes myapp-script in two, in the following ways and returning the > following values: > > monitor: OCF_NOT_RUNNING > stop: OCF_SUCCESS > start: OCF_SUCCESS > monitor: OCF_SUCCESS > > After this, with ps auwx in two I can see that A is indeed up and > running. However, the output from pcs status (in either one or two) > is now the following: > > Cluster name: FirstCluster > Stack: corosync > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition > with quorum > Last updated: Thu Apr 18 15:21:25 2019 > Last change: Thu Apr 18 13:50:57 2019 by root via cibadmin on one > > 2 nodes configured > 5 resources configured > > Online: [ one two ] > > Full list of resources: > > MyCluster (ocf::myapp:myapp-script): Started two > Master/Slave Set: DrbdDataClone [DrbdData] > Masters: [ two ] > Slaves: [ one ] > DrbdFS (ocf::heartbeat:Filesystem): Started two > disk_fencing (stonith:fence_scsi): Started one > > Failed Actions: > * MyCluster_monitor_30000 on two 'not running' (7): call=35, > status=complete, exitreason='', > last-rc-change='Thu Apr 18 15:21:12 2019', queued=0ms, exec=0ms > > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > And the cluster seems to stay stuck there, until I stop and start > node two explicitly. > > Is this the expected behavior? What I was expecting is for > Pacemaker to restart A, in either node - which it indeed does, in two > itself. But pcs status seems to think that an error happened when > trying to restart A - despite the fact that it got A restarted all > right. And I know that A is running correctly to boot. > > What am I misunderstanding here?
You got everything right, except the display is not saying the restart failed -- it's saying there was a monitor failure that led to the restart. The "failed actions" section is a history rather than the current status (which is the "full cluster status" section). The idea is that failures might occur when you're not looking :) and you can see that they happened the next time you check the status, even if the cluster was able to recover successfully. To clear the history, run "crm_resource -C -r MyCluster" (or "pcs resource cleanup MyCluster" if you're using pcs). -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/