On 07/11/2023 17:57, lejeczek via Users wrote:
hi guys
Having 3-node pgSQL cluster with PAF - when all three
systems are shutdown at virtually the same time then PAF
fails to start when HA cluster is operational again.
from status:
...
Migration Summary:
* Node: ubusrv2 (2):
* PGSQL-PAF-5433: migration-threshold=1000000
fail-count=1000000 last-failure='Tue Nov 7 17:52:38 2023'
* Node: ubusrv3 (3):
* PGSQL-PAF-5433: migration-threshold=1000000
fail-count=1000000 last-failure='Tue Nov 7 17:52:38 2023'
* Node: ubusrv1 (1):
* PGSQL-PAF-5433: migration-threshold=1000000
fail-count=1000000 last-failure='Tue Nov 7 17:52:38 2023'
Failed Resource Actions:
* PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90,
status='complete', exitreason='Unexpected state for
instance "PGSQL-PAF-5433" (returned 1)',
last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms,
exec=84ms
* PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82,
status='complete', exitreason='Unexpected state for
instance "PGSQL-PAF-5433" (returned 1)',
last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms,
exec=82ms
* PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86,
status='complete', exitreason='Unexpected state for
instance "PGSQL-PAF-5433" (returned 1)',
last-rc-change='Tue Nov 7 17:52:38 2023', queued=0ms,
exec=108ms
and all three pgSQLs show virtually identical logs:
...
2023-11-07 16:54:45.532 UTC [24936] LOG: starting
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-07 16:54:45.532 UTC [24936] LOG: listening on
IPv4 address "0.0.0.0", port 5433
2023-11-07 16:54:45.532 UTC [24936] LOG: listening on
IPv6 address "::", port 5433
2023-11-07 16:54:45.535 UTC [24936] LOG: listening on
Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-07 16:54:45.547 UTC [24938] LOG: database system
was interrupted while in recovery at log time 2023-11-07
15:30:56 UTC
2023-11-07 16:54:45.547 UTC [24938] HINT: If this has
occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
2023-11-07 16:54:45.819 UTC [24938] LOG: entering standby
mode
2023-11-07 16:54:45.824 UTC [24938] FATAL: could not open
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No
such file or directory
2023-11-07 16:54:45.825 UTC [24936] LOG: startup process
(PID 24938) exited with exit code 1
2023-11-07 16:54:45.825 UTC [24936] LOG: aborting startup
due to startup process failure
2023-11-07 16:54:45.826 UTC [24936] LOG: database system
is shut down
Is this "test" case's result, as I showed above, expected?
It reproduces every time.
If not - what might it be I'm missing?
many thanks, L.
Actually, the resource fails to start on a node a single
node - as opposed to entire cluster shutdown as I noted
originally - which was powered down in an orderly fashion
and powered back on.
That the the time of power-cycle the node was PAF resource
master, it fails:
...
2023-11-09 20:35:04.439 UTC [17727] LOG: starting
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-09 20:35:04.439 UTC [17727] LOG: listening on IPv4
address "0.0.0.0", port 5433
2023-11-09 20:35:04.439 UTC [17727] LOG: listening on IPv6
address "::", port 5433
2023-11-09 20:35:04.442 UTC [17727] LOG: listening on Unix
socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-09 20:35:04.452 UTC [17731] LOG: database system
was interrupted while in recovery at log time 2023-11-09
20:25:21 UTC
2023-11-09 20:35:04.452 UTC [17731] HINT: If this has
occurred more than once some data might be corrupted and you
might need to choose an earlier recovery target.
2023-11-09 20:35:04.809 UTC [17731] LOG: entering standby mode
2023-11-09 20:35:04.813 UTC [17731] FATAL: could not open
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such
file or directory
2023-11-09 20:35:04.814 UTC [17727] LOG: startup process
(PID 17731) exited with exit code 1
2023-11-09 20:35:04.814 UTC [17727] LOG: aborting startup
due to startup process failure
2023-11-09 20:35:04.815 UTC [17727] LOG: database system is
shut down
The master at the time node was shut down did get moved over
to standby/slave node, properly,
I'm on Ubuntu with:
ii corosync 3.1.6-1ubuntu1 amd64
cluster engine daemon and utilities
ii pacemaker 2.1.2-1ubuntu3.1 amd64
cluster resource manager
ii pacemaker-cli-utils 2.1.2-1ubuntu3.1 amd64
cluster resource manager command line utilities
ii pacemaker-common 2.1.2-1ubuntu3.1 all
cluster resource manager common files
ii pacemaker-resource-agents 2.1.2-1ubuntu3.1 all
cluster resource manager general resource agents
ii pcs 0.10.11-2ubuntu3 all Pacemaker
Configuration System
And here is the resource:
-> $ pcs resource config PGSQL-PAF-5433-clone
Clone: PGSQL-PAF-5433-clone
Meta Attrs: failure-timeout=20s master-max=1 notify=true
promotable=true
Resource: PGSQL-PAF-5433 (class=ocf provider=heartbeat
type=pgsqlms)
Attributes: bindir=/usr/lib/postgresql/14/bin
datadir=/var/lib/postgresql/14/paf
pgdata=/etc/postgresql/14/paf pgport=5433
Operations: demote interval=0s timeout=120s
(PGSQL-PAF-5433-demote-interval-0s)
methods interval=0s timeout=5
(PGSQL-PAF-5433-methods-interval-0s)
monitor interval=15s role=Master timeout=10s
(PGSQL-PAF-5433-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s
(PGSQL-PAF-5433-monitor-interval-16s)
notify interval=0s timeout=60s
(PGSQL-PAF-5433-notify-interval-0s)
promote interval=0s timeout=30s
(PGSQL-PAF-5433-promote-interval-0s)
reload interval=0s timeout=20
(PGSQL-PAF-5433-reload-interval-0s)
start interval=0s timeout=60s
(PGSQL-PAF-5433-start-interval-0s)
stop interval=0s timeout=60s
(PGSQL-PAF-5433-stop-interval-0s)
Is this my setup/config or there might actually be an issue
with the PAF |& HA not handling node-OS shutdown?
all & any thoughts are much apreciated.
Thanks, L.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/