Re: [ClusterLabs] PostgreSQL HA on EL9

2023-09-13 Thread Larry G. Mills via Users
> 
> On my RHEL 9 test cluster, both "reboot" and "systemctl reboot" wait
> for the cluster to stop everything.
> 
> I think in some environments "reboot" is equivalent to "systemctl
> reboot --force" (kill all processes immediately), so maybe see if
> "systemctl reboot" is better.
> 
> >
> > On EL7, this scenario caused the cluster to shut itself down on the
> > node before the OS shutdown completed, and the DB resource was
> > stopped/shutdown before the OS stopped.  On EL9, this is not the
> > case, the DB resource is not stopped before the OS shutdown
> > completes.  This leads to errors being thrown when the cluster is
> > started back up on the rebooted node similar to the following:
> >

Ken,

Thanks for the reply - and that's interesting that RHEL9 behaves as expected 
and AL9 seemingly doesn't.   I did try shutting down via "systemctl reboot", 
but the cluster and resources were still not stopped cleanly before the OS 
stopped.  In fact, the commands "shutdown" and "reboot" are just symlinks to 
systemctl on AL9.2, so that make sense why the behavior is the same.

Just as a point of reference, my systemd version is: systemd.x86_64 
252-14.el9_2.3

Larry
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PostgreSQL HA on EL9

2023-09-13 Thread Ken Gaillot
On Wed, 2023-09-13 at 16:45 +, Larry G. Mills via Users wrote:
> Hello Pacemaker community,
>  
> I have several two-node postgres 14 clusters that I am migrating from
> EL7 (Scientific Linux 7) to EL9 (AlmaLinux 9.2).
>  
> My configuration:
>  
> Cluster size: two nodes
> Postgres version: 14
> Corosync version: 3.1.7-1.el9  
> Pacemaker version: 2.1.5-9.el9_2
> pcs version: 0.11.4-7.el9_2
>  
> The migration has mostly gone smoothly, but I did notice one non-
> trivial change in recovery behavior between EL7 and EL9.  The
> recovery scenario is:
>  
> With the cluster running normally with one primary DB (i.e. Promoted)
> and one standby (i.e. Unpromoted), reboot one of the cluster nodes
> without first shutting down the cluster on that node.  The reboot is
> a “clean” system shutdown done via either the “reboot” or “shutdown”
> OS commands.

On my RHEL 9 test cluster, both "reboot" and "systemctl reboot" wait
for the cluster to stop everything.

I think in some environments "reboot" is equivalent to "systemctl
reboot --force" (kill all processes immediately), so maybe see if
"systemctl reboot" is better.

>  
> On EL7, this scenario caused the cluster to shut itself down on the
> node before the OS shutdown completed, and the DB resource was
> stopped/shutdown before the OS stopped.  On EL9, this is not the
> case, the DB resource is not stopped before the OS shutdown
> completes.  This leads to errors being thrown when the cluster is
> started back up on the rebooted node similar to the following:
> 
>   * pgsql probe on mynode returned 'error' (Instance "pgsql"
> controldata indicates a running secondary instance, the instance has
> probably crashed)
>  
> While this is not too serious for a standby DB instance, as the
> cluster is able to recover it back to the standby/Unpromoted state,
> if you reboot the Primary/Promoted DB node, the cluster is not able
> to recover it (because that DB still thinks it’s a primary), and the
> node is fenced.
>  
> Is this an intended behavior for the versions of pacemaker/corosync
> that I’m running, or a regression?   It may be possible to put an
> override into the systemd unit file for corosync to force the cluster
> to shutdown before the OS stops, but I’d rather not do that if
> there’s a better way to handle this recovery scenario.
>  
> Thanks for any advice,
>  
> Larry
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] PostgreSQL HA on EL9

2023-09-13 Thread Larry G. Mills via Users
Hello Pacemaker community,

I have several two-node postgres 14 clusters that I am migrating from EL7 
(Scientific Linux 7) to EL9 (AlmaLinux 9.2).

My configuration:

Cluster size: two nodes
Postgres version: 14
Corosync version: 3.1.7-1.el9
Pacemaker version: 2.1.5-9.el9_2
pcs version: 0.11.4-7.el9_2

The migration has mostly gone smoothly, but I did notice one non-trivial change 
in recovery behavior between EL7 and EL9.  The recovery scenario is:

With the cluster running normally with one primary DB (i.e. Promoted) and one 
standby (i.e. Unpromoted), reboot one of the cluster nodes without first 
shutting down the cluster on that node.  The reboot is a "clean" system 
shutdown done via either the "reboot" or "shutdown" OS commands.

On EL7, this scenario caused the cluster to shut itself down on the node before 
the OS shutdown completed, and the DB resource was stopped/shutdown before the 
OS stopped.  On EL9, this is not the case, the DB resource is not stopped 
before the OS shutdown completes.  This leads to errors being thrown when the 
cluster is started back up on the rebooted node similar to the following:

  * pgsql probe on mynode returned 'error' (Instance "pgsql" controldata 
indicates a running secondary instance, the instance has probably crashed)

While this is not too serious for a standby DB instance, as the cluster is able 
to recover it back to the standby/Unpromoted state, if you reboot the 
Primary/Promoted DB node, the cluster is not able to recover it (because that 
DB still thinks it's a primary), and the node is fenced.

Is this an intended behavior for the versions of pacemaker/corosync that I'm 
running, or a regression?   It may be possible to put an override into the 
systemd unit file for corosync to force the cluster to shutdown before the OS 
stops, but I'd rather not do that if there's a better way to handle this 
recovery scenario.

Thanks for any advice,

Larry
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] MySQL cluster with auto failover

2023-09-13 Thread Ken Gaillot
On Tue, 2023-09-12 at 10:28 +0200, Damiano Giuliani wrote:
> thanks Ken,
> 
> could you point me in th right direction for a guide or some already
> working configuration?
> 
> Thanks
> 
> Damiano

Nothing specific to galera, just the usual Pacemaker Explained
documentation about clones.

There are some regression tests in the code base that include galera
resources. Some use clones and others bundles (containerized). For
example:

https://github.com/ClusterLabs/pacemaker/blob/main/cts/scheduler/xml/unrunnable-2.xml


> 
> Il giorno lun 11 set 2023 alle ore 16:26 Ken Gaillot <
> kgail...@redhat.com> ha scritto:
> > On Thu, 2023-09-07 at 10:27 +0100, Antony Stone wrote:
> > > On Wednesday 06 September 2023 at 17:01:24, Damiano Giuliani
> > wrote:
> > > 
> > > > Everything is clear now.
> > > > So the point is to use pacemaker and create the floating vip
> > and
> > > > bind it to
> > > > sqlproxy to health check and route the traffic to the available
> > and
> > > > healthy
> > > > galera nodes.
> > > 
> > > Good summary.
> > > 
> > > > It could be useful let pacemaker manage also galera services?
> > > 
> > > No; MySQL / Galera needs to be running on all nodes all the
> > > time.  Pacemaker 
> > > is for managing resources which move between nodes.
> > 
> > It's still helpful to configure galera as a clone in the cluster.
> > That
> > way, Pacemaker can monitor it and restart it on errors, it will
> > respect
> > things like maintenance mode and standby, and it can be used in
> > ordering constraints with other resources, as well as advanced
> > features
> > such as node utilization.
> > 
> > > 
> > > If you want something that ensures processes are running on
> > > machines, 
> > > irrespective of where the floating IP is, look at monit - it's
> > very
> > > simple, 
> > > easy to configure and knows how to manage resources which should
> > run
> > > all the 
> > > time.
> > > 
> > > > Do you have any guide that pack this everything together?
> > > 
> > > No; I've largely made this stuff up myself as I've needed it.
> > > 
> > > 
> > > Antony.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-13 Thread Jehan-Guillaume de Rorthais via Users
On Wed, 13 Sep 2023 17:32:01 +0200
lejeczek via Users  wrote:

> On 08/09/2023 17:29, Jehan-Guillaume de Rorthais wrote:
> > On Fri, 8 Sep 2023 16:52:53 +0200
> > lejeczek via Users  wrote:
> >  
> >> Hi guys.
> >>
> >> Before I start fiddling and brake things I wonder if
> >> somebody knows if:
> >> pgSQL can work with: |wal_level = archive for PAF ?
> >> Or more general question with pertains to ||wal_level - can
> >> _barman_ be used with pgSQL "under" PAF?  
> > PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so
> > it can have hot standbys where it can connects and query there status.
> >
> > Wal level "replica" includes the archive level, so you can set up archiving.
> >
> > Of course you can use barman or any other tools to manage your PITR Backups,
> > even when Pacemaker/PAF is looking at your instances. This is even the very
> > first step you should focus on during your journey to HA.
> >
> > Regards,  
> and with _barman_ specifically - is one method preferred, 
> recommended over another: streaming VS rsync - for/with PAF?

PAF doesn't need PITR, nor it has preferred a method for it. Both are not
related.

The PITR procedure and tooling you are setting up will help for disaster
recovery, PRA, not PCA. So feel free to choose the ones that help you achieve
**YOUR** RTO/RPO needs in case of disaster.

The only vaguely related subject between PAF and your PITR tooling is how fast
ans easily you'll be able to setup a standby from backups if needed.

Just avoid slots if possible, as WALs could quickly fill your filesystem if
something goes wrong and slots must keep WALs around. If you must use them, set
max_slot_wal_keep_size.

++
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF / PGSQLMS and wal_level ?

2023-09-13 Thread lejeczek via Users




On 08/09/2023 17:29, Jehan-Guillaume de Rorthais wrote:

On Fri, 8 Sep 2023 16:52:53 +0200
lejeczek via Users  wrote:


Hi guys.

Before I start fiddling and brake things I wonder if
somebody knows if:
pgSQL can work with: |wal_level = archive for PAF ?
Or more general question with pertains to ||wal_level - can
_barman_ be used with pgSQL "under" PAF?

PAF needs "wal_level = replica" (or "hot_standby" on very old versions) so it
can have hot standbys where it can connects and query there status.

Wal level "replica" includes the archive level, so you can set up archiving.

Of course you can use barman or any other tools to manage your PITR Backups,
even when Pacemaker/PAF is looking at your instances. This is even the very
first step you should focus on during your journey to HA.

Regards,
and with _barman_ specifically - is one method preferred, 
recommended over another: streaming VS rsync - for/with PAF?

many thanks, L.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/