I observe strange behavior that I cannot understand. Pacemaker 1.1.11-3ca8c3b.
There is master/slave resource running. Maintenance-mode was set, pacemaker restarted, maintenance-mode reset. This specific RA returns Slave instead of Master for the first probe. But what happens later is rather strange. So on startup pacemaker initiates probe: Oct 29 11:25:59 n1 crmd[2879]: notice: te_rsc_command: Initiating action 3: monitor rsc_SAPHana_HDB_HDB00:0_monitor_0 on n1 (local) Oct 29 11:25:59 n1 crmd[2879]: notice: te_rsc_command: Initiating action 4: monitor rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0 on n1 (local) That's fine. Agents return OK (correctly for clone rsc_SAPHanaTopology_HDB, may be incorrectly for master/slave rsc_SAPHana_HDB): Oct 29 11:26:03 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHanaTopology_HDB_HDB00_monitor_0 (call=12, rc=0, cib-update=49, confirmed=true) ok Oct 29 11:26:03 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_monitor_0 (call=7, rc=0, cib-update=50, confirmed=true) ok But pacemaker suddenly counts each of them twice? Why? Oct 29 11:26:03 n1 crmd[2879]: warning: status_from_rc: Action 4 (rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc: 0): Error Oct 29 11:26:03 n1 crmd[2879]: warning: status_from_rc: Action 4 (rsc_SAPHanaTopology_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc: 0): Error Oct 29 11:26:03 n1 crmd[2879]: warning: status_from_rc: Action 3 (rsc_SAPHana_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc: 0): Error Oct 29 11:26:03 n1 crmd[2879]: warning: status_from_rc: Action 3 (rsc_SAPHana_HDB_HDB00:0_monitor_0) on n1 failed (target: 7 vs. rc: 0): Error The maintenance mode gets reset and pacemaker starts monitor Oct 29 11:32:06 n1 crmd[2879]: notice: te_rsc_command: Initiating action 29: monitor rsc_SAPHana_HDB_HDB00_monitor_61000 on n1 (local) which now properly indicates Master status Oct 29 11:32:09 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_monitor_61000 (call=55, rc=8, cib-update=110, confirmed=false) master which is being interpreted as error due to unexpectedness? Oct 29 11:32:09 n1 crmd[2879]: warning: status_from_rc: Action 29 (rsc_SAPHana_HDB_HDB00_monitor_61000) on n1 failed (target: 0 vs. rc: 8): Error Oct 29 11:32:09 n1 crmd[2879]: warning: update_failcount: Updating failcount for rsc_SAPHana_HDB_HDB00 on n1 after failed monitor: rc=8 (update=value++, time=1414571529) Oct 29 11:32:09 n1 attrd[2877]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-rsc_SAPHana_HDB_HDB00 (1) Fine, so far it is logical. But where second error and increase failure count suddenly comes from? We started just one monitor so far ... Oct 29 11:32:09 n1 attrd[2877]: notice: attrd_perform_update: Sent update 33: fail-count-rsc_SAPHana_HDB_HDB00=1 Oct 29 11:32:09 n1 attrd[2877]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-rsc_SAPHana_HDB_HDB00 (1414571529) Oct 29 11:32:09 n1 attrd[2877]: notice: attrd_perform_update: Sent update 35: last-failure-rsc_SAPHana_HDB_HDB00=1414571529 Oct 29 11:32:09 n1 attrd[2877]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-rsc_SAPHana_HDB_HDB00 (2) Now it starts recovery by stopping and then starting master. Oct 29 11:32:09 n1 pengine[2878]: notice: LogActions: Recover rsc_SAPHana_HDB_HDB00:0 (Master n1) Oct 29 11:32:09 n1 crmd[2879]: notice: te_rsc_command: Initiating action 38: demote rsc_SAPHana_HDB_HDB00_demote_0 on n1 (local) Oct 29 11:32:10 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_demote_0 (call=60, rc=0, cib-update=116, confirmed=true) ok Oct 29 11:32:10 n1 crmd[2879]: notice: te_rsc_command: Initiating action 6: stop rsc_SAPHana_HDB_HDB00_stop_0 on n1 (local) Oct 29 11:32:21 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_stop_0 (call=61, rc=0, cib-update=118, confirmed=true) ok Oct 29 11:32:21 n1 pengine[2878]: notice: LogActions: Start rsc_SAPHana_HDB_HDB00:1 (n1) Oct 29 11:32:21 n1 crmd[2879]: notice: te_rsc_command: Initiating action 38: start rsc_SAPHana_HDB_HDB00:1_start_0 on n1 (local) Oct 29 11:33:12 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_start_0 (call=62, rc=0, cib-update=121, confirmed=true) ok Oct 29 11:33:12 n1 pengine[2878]: notice: LogActions: Promote rsc_SAPHana_HDB_HDB00:0 (Slave -> Master n1) Oct 29 11:33:12 n1 crmd[2879]: notice: te_rsc_command: Initiating action 38: promote rsc_SAPHana_HDB_HDB00_promote_0 on n1 (local Oct 29 11:33:14 n1 crmd[2879]: notice: process_lrm_event: LRM operation rsc_SAPHana_HDB_HDB00_promote_0 (call=63, rc=0, cib-update=124, confirmed=true) ok So we successfully restarted and promoted local instance. And suddenly ... Oct 29 11:33:14 n1 pengine[2878]: warning: unpack_rsc_op_failure: Processing failed op monitor for rsc_SAPHana_HDB_HDB00:0 on n1: master (8) Oct 29 11:33:14 n1 pengine[2878]: notice: LogActions: Demote rsc_SAPHana_HDB_HDB00:0 (Master -> Slave n1) Oct 29 11:33:14 n1 pengine[2878]: notice: LogActions: Promote rsc_SAPHana_HDB_HDB00:1 (Slave -> Master n2) Why!?! _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org