Bug#705546: pacemaker fails to take action on clones and master/slave resources on-fail

gustavo panizzo <gfa> Tue, 16 Apr 2013 07:42:23 -0700

Package: pacemaker
Version: 1.1.7-1
Severity: normal

using pacemaker from wheezy i found on-fail settings are not honored on clones
and master/slave resources, problem as been already reported to upstream and 
they
have released a fix, i'm asking for the inclusion of the fix attached to debian.


the attached patch is upstream patch with minor (costmetic) differences in order
to get apply it cleanly to debian sources. 


thanks!

before patch:

# crm resource show msPostgresql
resource msPostgresql is running on: infra02
resource msPostgresql is running on: infra01 Master

# crm configure show msPostgresql
ms msPostgresql pgsql \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true" is-managed="true"

# crm configure show pgsql
primitive pgsql ocf:local:pgsql \
params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" psql="/usr/bin/psql" 
pgdata="/var/lib/postgresql/9.1/main" start_opt="-p 5432" rep_mode="sync" 
node_list="infra01 infra02" restore_command="cp 
/var/lib/postgresql/9.1/archive/%f %p" primary_conninfo_opt="keepalives_idle=60 
keepalives_interval=5 keepalives_count=5" master_ip="192.168.111.12" 
stop_escalate="0" config="/etc/postgresql/9.1/main/postgresql.conf" 
tmpdir="/var/lib/postgresql/tmp" 
pgctldata="/usr/lib/postgresql/9.1/bin/pg_controldata" repuser="repl" \
op start interval="0" timeout="120" on-fail="restart" \
op monitor interval="7" timeout="120" on-fail="stop" \
op monitor interval="2" role="Master" timeout="60" on-fail="restart" \
op promote interval="0" timeout="120" on-fail="restart" \
op demote interval="0" timeout="120" on-fail="stop" \
op stop interval="0" timeout="120" on-fail="block" \
op notify interval="0" timeout="90"

# kill `cat /var/run/postgresql/9.1-main.pid `

pgsql log
Apr 15 16:12:17 infra02 postgres[39723]: [2-1] 2013-04-15 16:12:17 ART LOG:  
received smart shutdown request
Apr 15 16:12:17 infra02 postgres[39769]: [1-1] 2013-04-15 16:12:17 ART LOG:  
shutting down
Apr 15 16:12:17 infra02 postgres[39769]: [2-1] 2013-04-15 16:12:17 ART LOG:  
database system is shut down

cluster log
Apr 15 16:12:17 infra02 pgsql[41389]: INFO: PostgreSQL is down
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_monitor_7000 (call=84, rc=7, cib-update=89, confirmed=false) not running
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_ais_dispatch: Update 
relayed from infra01
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-pgsql:0 (13)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent 
update 270: fail-count-pgsql:0=13
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_ais_dispatch: Update 
relayed from infra01
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-pgsql:0 (1366053137)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent 
update 272: last-failure-pgsql:0=1366053137
Apr 15 16:12:17 infra02 lrmd: [1438]: info: rsc:pgsql:0 notify[85] (pid 41435)
Apr 15 16:12:17 infra02 lrmd: [1438]: info: operation notify[85] on pgsql:0 for 
client 1441: pid 41435 exited with return code 0
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_notify_0 (call=85, rc=0, cib-update=0, confirmed=true) ok
Apr 15 16:12:17 infra02 lrmd: [1438]: info: cancel_op: operation monitor[84] on 
pgsql:0 for client 1441, its parameters: 
pgctl=[/usr/lib/postgresql/9.1/bin/pg_ctl] CRM_meta_clone=[0] 
config=[/etc/postgresql/9.1/main/postgresql.conf] CRM_meta_clone_max=[2] 
CRM_meta_globally_unique=[false] CRM_meta_notify_master_uname=[infra01 ] 
CRM_meta_notify_promote_uname=[ ] tmpdir=[/var/lib/postgresql/tmp] 
CRM_meta_notify_active_uname=[ ] start_opt=[-p 5432] 
CRM_meta_notify_stop_resource=[ ] CRM_meta_name=[monitor] 
CRM_meta_interval=[7000] CRM_meta_clone_node_max=[1] crm_fe cancelled
Apr 15 16:12:17 infra02 lrmd: [1438]: info: rsc:pgsql:0 stop[86] (pid 41471)
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_monitor_7000 (call=84, status=1, cib-update=0, confirmed=true) Cancelled
Apr 15 16:12:17 infra02 pgsql[41471]: INFO: PostgreSQL is already stopped.
Apr 15 16:12:17 infra02 pgsql[41471]: INFO: Changing pgsql-status on infra02 : 
HS:alone->STOP.
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: pgsql-status (STOP)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent 
update 274: pgsql-status=STOP
Apr 15 16:12:17 infra02 lrmd: [1438]: info: operation stop[86] on pgsql:0 for 
client 1441: pid 41471 exited with return code 0
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_stop_0 (call=86, rc=0, cib-update=90, confirmed=true) ok
Apr 15 16:12:18 infra02 lrmd: [1438]: info: rsc:pgsql:0 start[87] (pid 41525)
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: Set all nodes into async mode.
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: My Timeline ID and Checkpoint : 
7:00000000160000D0
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: infra01 master baseline : 
7:0000000017000070
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: server starting
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: PostgreSQL start command sent.
Apr 15 16:12:18 infra02 lrmd: [1438]: info: RA output: (pgsql:0:start:stderr) 
psql: could not connect to server: No such file or directory#012#011Is the 
server running locally and accepting#012#011connections on Unix domain socket 
"/var/run/postgresql/.s.PGSQL.5432"?
Apr 15 16:12:18 infra02 pgsql[41525]: WARNING: PostgreSQL template1 isn't 
running
Apr 15 16:12:18 infra02 pgsql[41525]: WARNING: Connection error (connection to 
the server went bad and the session was not interactive) occurred while 
executing the psql command.
Apr 15 16:12:19 infra02 pgsql[41525]: INFO: PostgreSQL is started.
Apr 15 16:12:19 infra02 pgsql[41525]: INFO: Changing pgsql-status on infra02 : 
STOP->HS:alone.
Apr 15 16:12:19 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: pgsql-status (HS:alone)
Apr 15 16:12:19 infra02 attrd: [1439]: notice: attrd_perform_update: Sent 
update 276: pgsql-status=HS:alone
Apr 15 16:12:19 infra02 lrmd: [1438]: info: operation start[87] on pgsql:0 for 
client 1441: pid 41525 exited with return code 0
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_start_0 (call=87, rc=0, cib-update=91, confirmed=true) ok
Apr 15 16:12:19 infra02 lrmd: [1438]: info: rsc:pgsql:0 notify[88] (pid 41771)
Apr 15 16:12:19 infra02 lrmd: [1438]: info: operation notify[88] on pgsql:0 for 
client 1441: pid 41771 exited with return code 0
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_notify_0 (call=88, rc=0, cib-update=0, confirmed=true) ok
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation 
pgsql:0_monitor_7000 (call=89, rc=0, cib-update=92, confirmed=false) ok


after patch:

# kill `cat /var/run/postgresql/9.1-main.pid `

cluster log
Apr 16 11:21:05 infra02 pgsql[100164]: INFO: PostgreSQL is down
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation 
pgsql:0_monitor_7000 (call=15, rc=7, cib-update=24, confirmed=false) not running
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_ais_dispatch: Update 
relayed from infra01
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-pgsql:0 (1)
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent 
update 47: fail-count-pgsql:0=1
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_ais_dispatch: Update 
relayed from infra01
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-pgsql:0 (1366122065)
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent 
update 50: last-failure-pgsql:0=1366122065
Apr 16 11:21:05 infra02 lrmd: [97195]: info: rsc:pgsql:0 notify[24] (pid 100206)
Apr 16 11:21:05 infra02 lrmd: [97195]: info: operation notify[24] on pgsql:0 
for client 97198: pid 100206 exited with return code 0
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation 
pgsql:0_notify_0 (call=24, rc=0, cib-update=0, confirmed=true) ok
Apr 16 11:21:05 infra02 lrmd: [97195]: info: cancel_op: operation monitor[15] 
on pgsql:0 for client 97198, its parameters: 
pgctl=[/usr/lib/postgresql/9.1/bin/pg_ctl] CRM_meta_clone=[0] 
config=[/etc/postgresql/9.1/main/postgresql.conf] CRM_meta_clone_max=[2] 
CRM_meta_globally_unique=[false] CRM_meta_notify_master_uname=[ ] 
CRM_meta_notify_promote_uname=[ ] tmpdir=[/var/lib/postgresql/tmp] 
CRM_meta_notify_active_uname=[ ] start_opt=[-p 5432] 
CRM_meta_notify_stop_resource=[ ] CRM_meta_name=[monitor] 
CRM_meta_interval=[7000] CRM_meta_clone_node_max=[1] crm_feature_ cancelled
Apr 16 11:21:05 infra02 lrmd: [97195]: info: rsc:pgsql:0 stop[25] (pid 100241)
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation 
pgsql:0_monitor_7000 (call=15, status=1, cib-update=0, confirmed=true) Cancelled
Apr 16 11:21:05 infra02 pgsql[100241]: INFO: PostgreSQL is already stopped.
Apr 16 11:21:05 infra02 pgsql[100241]: INFO: Changing pgsql-status on infra02 : 
HS:alone->STOP.
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: pgsql-status (STOP)
Apr 16 11:21:05 infra02 lrmd: [97195]: info: operation stop[25] on pgsql:0 for 
client 97198: pid 100241 exited with return code 0
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent 
update 52: pgsql-status=STOP
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation 
pgsql:0_stop_0 (call=25, rc=0, cib-update=25, confirmed=true) ok


-- System Information:
Debian Release: 7.0
  APT prefers testing
  APT policy: (900, 'testing'), (500, 'testing-updates'), (300, 'unstable'), 
(1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.2.0-4-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Description: fixes a bug on cloned and master/slave resources handling during
 failures.
 .
Author: gustavo panizzo <g...@zumbi.com.ar>

Origin: upstream, https://github.com/beekhof/pacemaker/commit/6a48a8b
Bug-Debian: 
Forwarded: not-needed
Last-Update: <2013-04-16>

--- pacemaker-1.1.7.orig/lib/pengine/utils.c
+++ pacemaker-1.1.7/lib/pengine/utils.c
@@ -544,7 +544,6 @@ unpack_operation(action_t * action, xmlN
 
     unpack_instance_attributes(data_set->input, xml_obj, XML_TAG_ATTR_SETS,
                                NULL, action->meta, NULL, FALSE, data_set->now);
-
     g_hash_table_remove(action->meta, "id");
 
     class = g_hash_table_lookup(action->rsc->meta, "class");
@@ -785,12 +784,19 @@ find_rsc_op_entry(resource_t * rsc, cons
             }
 
             match_key = generate_op_key(rsc->id, name, number);
-
             if (safe_str_eq(key, match_key)) {
                 op = operation;
             }
             crm_free(match_key);
 
+            if(rsc->clone_name) {
+                match_key = generate_op_key(rsc->clone_name, name, number);
+                if (safe_str_eq(key, match_key)) {
+                    op = operation;
+                }
+                crm_free(match_key);
+            }
+
             if (op != NULL) {
                 crm_free(local_key);
                 return op;

Bug#705546: pacemaker fails to take action on clones and master/slave resources on-fail

Reply via email to