Thanks. See my comments interspersed below. On Mon, Apr 15, 2019 at 4:30 PM Ken Gaillot <kgail...@redhat.com> wrote:
> On Mon, 2019-04-15 at 14:15 -0600, JCA wrote: > > I have a simple two-node cluster, node one and node two, with a > > single resource, ClusterMyApp. The nodes are CentOS 7 VMs. The > > resource is created executing the following line in node one: > > > > # pcs resource create ClusterMyApp ocf:myapp:myapp-script op > > monitor interval=30s > > FYI, it doesn't matter which node you change the configuration on -- > it's automatically sync'd across all nodes. > > OK. > This invokes myapp-script, which I installed under > > /usr/lib/ocf/resource.d/myapp/myapp-script, both in one and two - > > i.e. it is exactly the same script in both nodes. > > > > On executing the command above in node one, I get the following log > > entries in node one itself: > > > > Apr 15 13:40:12 one crmd[13670]: notice: Result of probe operation > > for ClusterMyApp on one: 7 (not running) > > Apr 15 13:40:12 one crmd[13670]: notice: Result of start operation > > for ClusterMyApp on one: 0 (ok) > > Whenever starting the cluster on a node, or adding a resource, > pacemaker probes the state of all resources on the node (or all nodes > in the case of adding a resource) by calling the agent's "monitor" > action once. You'll see this "Result of probe operation" for all > resources on all nodes. > > This allows pacemaker to detect if and where a service is already > running. > > OK. > > This is in line with what I expect from myapp-script when invoked > > with the 'start' option (which is what the command above is doing.) > > myapp-script first checks out whether my app is running, and if it is > > not then launches it. The rest of the log entries are to do with my > > app, indicating that it started without any problems. > > > > In node two, when the command above is executed in one, the following > > log entries are generated: > > > > Apr 15 13:40:12 two crmd[4293]: notice: State transition S_IDLE -> > > S_POLICY_ENGINE > > Apr 15 13:40:12 two pengine[4292]: notice: * Start > > ClusterMyApp (one ) > > Apr 15 13:40:12 two pengine[4292]: notice: Calculated transition 16, > > saving inputs in /var/lib/pacemaker/pengine/pe-input-66.bz2 > > At any given time, one node in the cluster is elected the "Designated > Controller" (DC). That node will calculate what (if anything) needs to > be done, and tell the right nodes to do it. Above, it has determined > that ClusterMyApp needs to be started on node one. > OK. > > > Apr 15 13:40:12 two crmd[4293]: notice: Initiating monitor operation > > ClusterMyApp_monitor_0 locally on two > > Apr 15 13:40:12 two crmd[4293]: notice: Initiating monitor operation > > ClusterMyApp_monitor_0 on one > > The cluster first probes the current state of the service on both > nodes, before any actions have been taken. The expected result is "not > running". > OK. > > > Apr 15 13:40:12 two crmd[4293]: notice: Result of probe operation > > for ClusterMyApp on two: 7 (not running) > > Apr 15 13:40:12 two crmd[4293]: notice: Initiating start operation > > ClusterMyApp_start_0 on one > > Apr 15 13:40:12 two crmd[4293]: notice: Initiating monitor operation > > ClusterMyApp_monitor_30000 on one > > Apr 15 13:40:12 two crmd[4293]: warning: Action 4 > > (ClusterMyApp_monitor_30000) on one failed (target: 0 vs. rc: 7): > > Error > > Apr 15 13:40:12 two crmd[4293]: notice: Transition aborted by > > operation ClusterMyApp_monitor_30000 'create' on one: Event failed > > The cluster successfully probed the service on both nodes, and started > it on node one. It then tried to start a 30-second recurring monitor > for the service, but the monitor immediately failed (the expected > result was running, but the monitor said it was not running). > It failed, where? In one, I know for a fact that my app is running, as reported by ps. I also know it has started correctly and is sitting there for stuff to do - it depends on timers and external events. In two, of course, it is not running. > > > After doing all of the above, pcs status returns the following, when > > invoked in either node: > > > > Cluster name: MyCluster > > Stack: corosync > > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition > > with quorum > > Last updated: Mon Apr 15 13:45:14 2019 > > Last change: Mon Apr 15 13:40:11 2019 by root via cibadmin on one > > > > 2 nodes configured > > 1 resource configured > > > > Online: [ one two ] > > > > Full list of resources: > > > > ClusterMyApp (ocf::myapp:myapp-script): Started one > > > > Failed Actions: > > * ClusterMyApp_monitor_30000 on one 'not running' (7): call=37, > > status=complete, exitreason='', > > last-rc-change='Mon Apr 15 13:40:12 2019', queued=0ms, exec=105ms > > > > > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > The start function in this script is as follows: > > > > myapp_start() { > > myapp_conf_check > > local diagnostic=$? > > > > if [ $diagnostic -ne $OCF_SUCCESS ]; then > > return $diagnostic > > fi > > > > myapp_monitor > > > > local state=$? > > > > case $state in > > $OCF_SUCCESS) > > return $OCF_SUCCESS > > ;; > > > > $OCF_NOT_RUNNING) > > myapp_launch > /dev/null 2>&1 > > if [ $? -eq 0 ]; then > > return $OCF_SUCCESS > > fi > > > > return $OCF_ERR_GENERIC > > ;; > > > > *) > > return $state > > ;; > > esac > > } > > > > I know for a fact that, in one, myapp_launch gets invoked, and that > > its exit value is 0. The function therefore returns OCF_SUCCESS, as > > it should. However, if I understand things correctly, the log entries > > in two seem to claim that the exit value of the script in one is > > OCF_NOT_RUNNING. > > The start succeeded. It's the recurring monitor that failed. > I assume that the recurring monitor invokes the myapp_monitor function that I created at the same time as myapp_start. Well, as far as I can tell, the problem seems to be that, as I mentioned in later posts. myapp_start gets invoked several times when creating the resource. As a result - although I have yet to understand it in detail - myapp_monitor occasionally fails in that sequence. Whether the resource starts correctly or no depends on what happens the last time myapp_monitor is invoked when creating the resource - which is why the whole thing works every so often. Why is myapp_start invoked so many times on creating the resource? Who/what is controlling this? Is it configurable? > > > > What's going on here? It's obviously something to do with myapp- > > script - but, what? > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/