[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...

Casey & Gina Mon, 02 Apr 2018 13:05:51 -0700

Hi!

I've set up a couple test Pacemaker/Corosync/PCS clusters on virtual machines 
to manage a PostgreSQL service using the PostgreSQL Automatic Failover (PAF) 
resource agent available here:  https://clusterlabs.github.io/PAF/


The cluster will be up and running with 3 nodes and PostgreSQL running 
successfully.  However I've run into a nasty stumbling block when attempting to 
restart a node or even just issuing `pcs cluster stop` and `pcs cluster start` 
on a node.

The operating system on the nodes is Ubuntu 16.04.

Firstly, here's my configuration (using `crm configure show`):

------
node 1: d-gp2-dbp62-1 \
        attributes master-postgresql-10-main=1001
node 2: d-gp2-dbp62-2 \
        attributes master-postgresql-10-main=1000
node 3: d-gp2-dbp62-3 \
        attributes master-postgresql-10-main=990
primitive postgresql-10-main pgsqlms \
        params bindir="/usr/lib/postgresql/10/bin" 
pgdata="/var/lib/postgresql/10/main" pghost="/var/run/postgresql/10/main" 
pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" 
start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" \
        op start interval=0s timeout=60s \
        op stop interval=0s timeout=60s \
        op promote interval=0s timeout=30s \
        op demote interval=0s timeout=120s \
        op monitor interval=15s role=Master timeout=10s \
        op monitor interval=16s role=Slave timeout=10s \
        op notify interval=0s timeout=60s
primitive postgresql-master-vip IPaddr2 \
        params ip=10.124.167.174 cidr_netmask=22 \
        op start interval=0s timeout=20s \
        op stop interval=0s timeout=20s \
        op monitor interval=10s
ms postgresql-ha postgresql-10-main \
        meta notify=true target-role=Master
order demote-postgresql-then-remove-vip Mandatory: postgresql-ha:demote 
postgresql-master-vip:stop symmetrical=false
colocation postgresql-master-with-vip inf: postgresql-master-vip:Started 
postgresql-ha:Master
order promote-postgresql-then-add-vip Mandatory: postgresql-ha:promote 
postgresql-master-vip:start symmetrical=false
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.14-70404b0 \
        cluster-infrastructure=corosync \
        cluster-name=d-gp2-dbp62 \
        stonith-enabled=false
rsc_defaults rsc_defaults-options: \
        migration-threshold=5 \
        resource-stickiness=10
------


With the resource running on the first node, here's the output of `pcs status`, 
and everything looks great:


------
Cluster name: d-gp2-dbp62
Last updated: Mon Apr  2 19:43:52 2018          Last change: Mon Apr  2 
19:41:35 2018 by root via crm_resource on d-gp2-dbp62-1
Stack: corosync
Current DC: d-gp2-dbp62-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ d-gp2-dbp62-1 d-gp2-dbp62-2 d-gp2-dbp62-3 ]

Full list of resources:

 postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-gp2-dbp62-1
 Master/Slave Set: postgresql-ha [postgresql-10-main]
     Masters: [ d-gp2-dbp62-1 ]
     Slaves: [ d-gp2-dbp62-2 d-gp2-dbp62-3 ]
------


Now, if I restart the second node, and execute `pcs cluster start` once it's 
back up, it fails to start the resource and shows me this in the `pcs status` 
output:


------
* postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11, 
status=complete, exitreason='Instance "postgresql-10-main" failed to start (rc: 
1)',
    last-rc-change='Mon Apr  2 19:50:40 2018', queued=0ms, exec=228ms
------


Looking in corosync.log, I see the following:


------
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:     info: log_execute: 
executing - rsc:postgresql-10-main action:start call_id:11
pgsqlms(postgresql-10-main)[2092]:      2018/04/02_19:50:40  ERROR: Instance 
"postgresql-10-main" failed to start (rc: 1)
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  
postgresql-10-main_start_0:2092:stderr [ pg_ctl: could not start server ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  
postgresql-10-main_start_0:2092:stderr [ Examine the log output. ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  
postgresql-10-main_start_0:2092:stderr [ ocf-exit-reason:Instance 
"postgresql-10-main" failed to start (rc: 1) ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:     info: log_finished:        
finished - rsc:postgresql-10-main action:start call_id:11 pid:2092 exit-code:1 
exec-time:228ms queue-time:0ms
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:     info: action_synced_wait:  
Managed pgsqlms_meta-data_0 process 2115 exited with rc=0
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:   notice: process_lrm_event:   
Operation postgresql-10-main_start_0: unknown error (node=d-gp2-dbp62-2, 
call=11, rc=1, cib-update=15, confirmed=true)
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:   notice: process_lrm_event:   
d-gp2-dbp62-2-postgresql-10-main_start_0:11 [ pg_ctl: could not start 
server\nExamine the log output.\nocf-exit-reason:Instance "postgresql-10-main" 
failed to start (rc: 1)\n ]
------


It tells me to examine the PostgreSQL log output, so I look there, but I don't 
see anything logged at all since the server shutdown.  There are no new log 
entries indicating that a start was attempted.  I can repeat the `pcs cluster 
start/stop` commands any number of times, and the problem remains.

Now for the weird part, which is the workaround I inadvertently discovered when 
trying to figure out what was going wrong.  First, I do a `pcs cluster stop` 
again.  Then I start up the service using systemd, and it starts up just fine; 
so I do that and then shut it back down using `service postgresql@10-main 
start` and `service postgresql@10-main stop`, which should put it back in the 
original state.  Now, when I issue `pcs cluster start`, everything comes up 
just fine as expected with no errors!?!

I have also seen this "unknown error" come up at other undesirable times, like 
when doing a manual failover using `pcs cluster stop` on the primary or a `pcs 
resource move --master ...` command, however once the workaround is applied to 
the node having issues, it works perfectly fine until it's rebooted.

Can anyone explain what is happening here, and how I can fix it properly?

Best wishes,
-- 
Casey
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...

Reply via email to