Hi, our pacemaker setup provides mysql resource using ocf resource agent. Today I tested with my colleagues forcing mysql resource to fail. I don't understand the following behaviour. When I remove the mysqld_safe binary (which path is specified in crm config) from one server and moving the mysql resource to this server, the resource will not fail back and stays in the "unmanaged" status. We can see that the function check_binary(); is called within the mysql ocf resource agent and exists with error code "5". The fail-count gets raised to INFINITY and pacemaker tries to "stop" the resource fails. This results in a "unmanaged" status.
How to reproduce: 1. mysql resource is running on node1 2. on node2 mv /usr/bin/mysqld_safe{,.bak} 3. crm resource move group-MySQL node2 4. observe corosync.log and crm_mon # cat /var/log/corosync/corosync.log [...] May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true) ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL-IP1_monitor_30000 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121] (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output: (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on res-MySQL for client 1896: pid 5223 exited with return code 5 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not installed May 16 10:53:41 node2 lrmd: [1893]: info: operation monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with return code 0 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL-IP1_monitor_30000 (call=120, rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd: [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47: last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]: info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_notify=[true] CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2] CRM_meta_master_node_max=[1] CRM_meta_interval=[29000] CRM_meta_globally_unique=[false] CRM_meta_master_max=[1] cancelled May 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource op res-DRBD-MySQL:1_monitor_29000 from 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e: lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid 5278) [...] I can not figure out why the fail-count gets raised to INFINITY and especially why pacemaker tries to stop the resource after failing. Shouldn't it be the best for the resource to fail back to another node instead of resulting in a "unmanaged" status on the node? is it possible to force this behavior in any way? Here some specs of the software used on our cluster nodes: node1:~# lsb_release -d && dpkg -l pacemaker | awk '/ii/{print $2,$3}' && uname -ri Description: Ubuntu 12.04.2 LTS pacemaker 1.1.6-2ubuntu3 3.2.0-41-generic x86_64 Best regards Vladimir _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org