Hi! I've set up a couple test Pacemaker/Corosync/PCS clusters on virtual machines to manage a PostgreSQL service using the PostgreSQL Automatic Failover (PAF) resource agent available here: https://clusterlabs.github.io/PAF/
The cluster will be up and running with 3 nodes and PostgreSQL running successfully. However I've run into a nasty stumbling block when attempting to restart a node or even just issuing `pcs cluster stop` and `pcs cluster start` on a node. The operating system on the nodes is Ubuntu 16.04. Firstly, here's my configuration (using `crm configure show`): ------ node 1: d-gp2-dbp62-1 \ attributes master-postgresql-10-main=1001 node 2: d-gp2-dbp62-2 \ attributes master-postgresql-10-main=1000 node 3: d-gp2-dbp62-3 \ attributes master-postgresql-10-main=990 primitive postgresql-10-main pgsqlms \ params bindir="/usr/lib/postgresql/10/bin" pgdata="/var/lib/postgresql/10/main" pghost="/var/run/postgresql/10/main" pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" \ op start interval=0s timeout=60s \ op stop interval=0s timeout=60s \ op promote interval=0s timeout=30s \ op demote interval=0s timeout=120s \ op monitor interval=15s role=Master timeout=10s \ op monitor interval=16s role=Slave timeout=10s \ op notify interval=0s timeout=60s primitive postgresql-master-vip IPaddr2 \ params ip=10.124.167.174 cidr_netmask=22 \ op start interval=0s timeout=20s \ op stop interval=0s timeout=20s \ op monitor interval=10s ms postgresql-ha postgresql-10-main \ meta notify=true target-role=Master order demote-postgresql-then-remove-vip Mandatory: postgresql-ha:demote postgresql-master-vip:stop symmetrical=false colocation postgresql-master-with-vip inf: postgresql-master-vip:Started postgresql-ha:Master order promote-postgresql-then-add-vip Mandatory: postgresql-ha:promote postgresql-master-vip:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=d-gp2-dbp62 \ stonith-enabled=false rsc_defaults rsc_defaults-options: \ migration-threshold=5 \ resource-stickiness=10 ------ With the resource running on the first node, here's the output of `pcs status`, and everything looks great: ------ Cluster name: d-gp2-dbp62 Last updated: Mon Apr 2 19:43:52 2018 Last change: Mon Apr 2 19:41:35 2018 by root via crm_resource on d-gp2-dbp62-1 Stack: corosync Current DC: d-gp2-dbp62-3 (version 1.1.14-70404b0) - partition with quorum 3 nodes and 4 resources configured Online: [ d-gp2-dbp62-1 d-gp2-dbp62-2 d-gp2-dbp62-3 ] Full list of resources: postgresql-master-vip (ocf::heartbeat:IPaddr2): Started d-gp2-dbp62-1 Master/Slave Set: postgresql-ha [postgresql-10-main] Masters: [ d-gp2-dbp62-1 ] Slaves: [ d-gp2-dbp62-2 d-gp2-dbp62-3 ] ------ Now, if I restart the second node, and execute `pcs cluster start` once it's back up, it fails to start the resource and shows me this in the `pcs status` output: ------ * postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11, status=complete, exitreason='Instance "postgresql-10-main" failed to start (rc: 1)', last-rc-change='Mon Apr 2 19:50:40 2018', queued=0ms, exec=228ms ------ Looking in corosync.log, I see the following: ------ Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: info: log_execute: executing - rsc:postgresql-10-main action:start call_id:11 pgsqlms(postgresql-10-main)[2092]: 2018/04/02_19:50:40 ERROR: Instance "postgresql-10-main" failed to start (rc: 1) Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ pg_ctl: could not start server ] Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ Examine the log output. ] Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: notice: operation_finished: postgresql-10-main_start_0:2092:stderr [ ocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1) ] Apr 02 19:50:40 [2014] d-gp2-dbp62-2 lrmd: info: log_finished: finished - rsc:postgresql-10-main action:start call_id:11 pid:2092 exit-code:1 exec-time:228ms queue-time:0ms Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: info: action_synced_wait: Managed pgsqlms_meta-data_0 process 2115 exited with rc=0 Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: notice: process_lrm_event: Operation postgresql-10-main_start_0: unknown error (node=d-gp2-dbp62-2, call=11, rc=1, cib-update=15, confirmed=true) Apr 02 19:50:40 [2017] d-gp2-dbp62-2 crmd: notice: process_lrm_event: d-gp2-dbp62-2-postgresql-10-main_start_0:11 [ pg_ctl: could not start server\nExamine the log output.\nocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1)\n ] ------ It tells me to examine the PostgreSQL log output, so I look there, but I don't see anything logged at all since the server shutdown. There are no new log entries indicating that a start was attempted. I can repeat the `pcs cluster start/stop` commands any number of times, and the problem remains. Now for the weird part, which is the workaround I inadvertently discovered when trying to figure out what was going wrong. First, I do a `pcs cluster stop` again. Then I start up the service using systemd, and it starts up just fine; so I do that and then shut it back down using `service postgresql@10-main start` and `service postgresql@10-main stop`, which should put it back in the original state. Now, when I issue `pcs cluster start`, everything comes up just fine as expected with no errors!?! I have also seen this "unknown error" come up at other undesirable times, like when doing a manual failover using `pcs cluster stop` on the primary or a `pcs resource move --master ...` command, however once the workaround is applied to the node having issues, it works perfectly fine until it's rebooted. Can anyone explain what is happening here, and how I can fix it properly? Best wishes, -- Casey _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org