On Wed, 2023-09-13 at 16:45 +0000, Larry G. Mills via Users wrote: > Hello Pacemaker community, > > I have several two-node postgres 14 clusters that I am migrating from > EL7 (Scientific Linux 7) to EL9 (AlmaLinux 9.2). > > My configuration: > > Cluster size: two nodes > Postgres version: 14 > Corosync version: 3.1.7-1.el9 > Pacemaker version: 2.1.5-9.el9_2 > pcs version: 0.11.4-7.el9_2 > > The migration has mostly gone smoothly, but I did notice one non- > trivial change in recovery behavior between EL7 and EL9. The > recovery scenario is: > > With the cluster running normally with one primary DB (i.e. Promoted) > and one standby (i.e. Unpromoted), reboot one of the cluster nodes > without first shutting down the cluster on that node. The reboot is > a “clean” system shutdown done via either the “reboot” or “shutdown” > OS commands.
On my RHEL 9 test cluster, both "reboot" and "systemctl reboot" wait for the cluster to stop everything. I think in some environments "reboot" is equivalent to "systemctl reboot --force" (kill all processes immediately), so maybe see if "systemctl reboot" is better. > > On EL7, this scenario caused the cluster to shut itself down on the > node before the OS shutdown completed, and the DB resource was > stopped/shutdown before the OS stopped. On EL9, this is not the > case, the DB resource is not stopped before the OS shutdown > completes. This leads to errors being thrown when the cluster is > started back up on the rebooted node similar to the following: > > * pgsql probe on mynode returned 'error' (Instance "pgsql" > controldata indicates a running secondary instance, the instance has > probably crashed) > > While this is not too serious for a standby DB instance, as the > cluster is able to recover it back to the standby/Unpromoted state, > if you reboot the Primary/Promoted DB node, the cluster is not able > to recover it (because that DB still thinks it’s a primary), and the > node is fenced. > > Is this an intended behavior for the versions of pacemaker/corosync > that I’m running, or a regression? It may be possible to put an > override into the systemd unit file for corosync to force the cluster > to shutdown before the OS stops, but I’d rather not do that if > there’s a better way to handle this recovery scenario. > > Thanks for any advice, > > Larry -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/