On Tue, Jul 8, 2025 at 4:03 PM shveta malik <shveta.ma...@gmail.com> wrote:
>
> On Mon, Jul 7, 2025 at 7:00 PM Ashutosh Bapat
> <ashutosh.bapat....@gmail.com> wrote:
> >
> > On Mon, Jul 7, 2025 at 9:53 AM shveta malik <shveta.ma...@gmail.com> wrote:
> > >
> > > Thanks for the patch.
> > >
> > > > I couldn't figure out whether the query on primary to fetch all the
> > > > slots to be synchronized should filter based on invalidation_reason
> > > > and conflicting or not. According to synchronize_slots(), it seems
> > > > that we retain invalidated slots on standby when failover = true and
> > > > they would remain with synced = true on standby. Is that right?
> > > >
> > >
> > > Yes, that’s correct. We sync the invalidation status of replication
> > > slots from the primary to the standby and then stop synchronizing any
> > > slots that have been marked as invalidated, retaining synced flag as
> > > true. IMO, there's no need to filter out conflicting slots on the
> > > primary, because instead of excluding them there and showing
> > > everything as failover-ready on the standby, the correct approach is
> > > to reflect the actual state on standby.This means conflicting slots
> > > will appear as non-failover-ready on the standby. That’s why Step 3
> > > also considers conflicting flag in its evaluation.
> >
> > Thanks for the explanation. WFM.
> >
> > If a slot is invalidated because RS_INVAL_WAL_REMOVED, conflicting =
> > false, but the slot is not useful standby. But then it's not useful on
> > primary as well. Is that why we are not including (invalidation_reason
> > IS NOT NULL) condition along in (synced AND NOT temporary AND NOT
> > conflicting)?
>
> Thanks for bringing it up. Even if the slot is not useful on the
> primary node as well, we shall still show failover-ready as false on
> standby. We should modify the query of step3 to check
> 'invalidation_reason IS NULL' instead of  'NOT conflicting'. That will
> cover all the cases where the slot  is invalidated and thus not
> failover ready.

Thanks for the confirmation.

>
> > >
> > > 1)
> > > +/* primary # */ SELECT array_agg(quote_literal(r.slot_name)) AS slots
> > > +               FROM pg_replication_slots r
> > > +               WHERE r.failover AND NOT r.temporary;
> > >
> > > On primary, there is no  need to check temporary-status. We do not
> > > allow setting failover as true for temporary slots.
> >
> > Why does synchronize_slots() has it in its query?
> >
>
> It is not needed but no harm in maintaining it.
> If we want documents to be in sync with code, we can have 'not
> temporary' check in doc as well.
>

I think it's better to keep the code and the doc in sync otherwise we
developers would get confused.

> > >
> > > 2)
> > > Although not directly related to the concerns addressed in the given
> > > patch, I think it would be helpful to add a note in the original doc
> > > stating that Steps 1 and 2 should be executed on each subscriber node
> > > that will be served by the standby after failover.
> >
> > There's a bit of semantic repeatition here since an earlier paragraph
> > mentions a given subscriber. But I think overall it's better to be
> > more clear than being less clear.
> > >
> > > I have attached a top-up patch with the above changes and a few more
> > > trivial changes. Please include it if you find it okay.
> >
> > Thanks. Included. I have made a few edits and included them in the
> > attached patch.
> >
>
> Thanks. The existing changes (originally targeted in this patch) look
> good to me.
>
> I have attached a top-up patch for step-3 correction. Please include
> if you find it okay to be fixed in the same patch, otherwise we can
> handle it separately.

I have split your top up patch into 2 - one related to the document
change being the subject of this thread and the other for fixing the
query. Committer may squash the patch, if they think so.



--
Best Wishes,
Ashutosh Bapat
From fd0b6dca15b8455597ce4e7ae7e6802335e99b69 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat....@gmail.com>
Date: Fri, 4 Jul 2025 19:35:47 +0530
Subject: [PATCH 1/2] Clarify logical replication failover document

Clarify that the existing steps in logical replication failover section
are meant for a given subscriber. Also clarify that the first two steps
are for a PostgreSQL subscriber. Add a separate paragraph to specify the
query to be used to make sure that all the replication slots have been
synchronized to the required standby in case of a planned failover.

Author: Ashutosh Bapat <ashutosh.bapat....@gmail.com>
Author: Shveta Malik <shveta.ma...@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5uiZ-fF159=jwbwpmbjzezdtmctbn+hd4mrurlcg2u...@mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml | 40 ++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index f317ed9c50e..7b3558ff248 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -709,8 +709,8 @@ HINT:  To initiate replication, you must manually create the replication slot, e
   </para>
 
   <para>
-   To confirm that the standby server is indeed ready for failover, follow these
-   steps to verify that all necessary logical replication slots have been
+   To confirm that the standby server is indeed ready for failover for a given subscriber, follow these
+   steps to verify that all the logical replication slots required by that subscriber have been
    synchronized to the standby server:
   </para>
 
@@ -782,10 +782,42 @@ HINT:  To initiate replication, you must manually create the replication slot, e
   <para>
    If all the slots are present on the standby server and the result
    (<literal>failover_ready</literal>) of the above SQL query is true, then
-   existing subscriptions can continue subscribing to publications now on the
-   new primary server.
+   existing subscriptions can continue subscribing to publications on the new
+   primary server.
+  </para>
+
+  <para>
+   The first two steps in the above procedure are meant for a
+   <productname>PostgreSQL</productname> subscriber. It is recommended to run
+   these steps on each subscriber node, that will be served by the designated
+   standby after failover, to obtain the complete list of replication
+   slots. This list can then be verified in Step 3 to ensure failover readiness.
+   Non-<productname>PostgreSQL</productname> subscribers, on the other hand, may
+   use their own methods to identify the replication slots used by their
+   respective subscriptions.
+  </para>
+
+  <para>
+   In some cases, such as during a planned failover, it is necessary to confirm
+   that all subscribers, whether <productname>PostgreSQL</productname> or
+   non-<productname>PostgreSQL</productname>, will be able to continue
+   replication after failover to a given standby server. In such cases, use the
+   following SQL, instead of performing the first two steps above, to identify
+   which replication slots on the primary need to be synced to the standby that
+   is intended for promotion. This query returns the relevant replication slots
+   associated with all the failover-enabled subscriptions.
   </para>
 
+   <para>
+<programlisting>
+/* primary # */ SELECT array_agg(quote_literal(r.slot_name)) AS slots
+               FROM pg_replication_slots r
+               WHERE r.failover AND NOT r.temporary;
+ slots
+-------
+ {'sub1','sub2','sub3', 'pg_16394_sync_16385_7394666715149055164'}
+(1 row)
+</programlisting></para>
  </sect1>
 
  <sect1 id="logical-replication-row-filter">

base-commit: a27893df45ec5d8c657899202e9cf0b9a816fe2f
-- 
2.34.1

From 5b541b0abb182ce029073ce4a2398bba96e7c3d5 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat <ashutosh.bapat....@gmail.com>
Date: Tue, 8 Jul 2025 19:17:44 +0530
Subject: [PATCH 2/2] Correction in logical failover document standby query

The logical failover documenation mentions a query to be run on the
standby to make sure that all the failover replication slots have been
synced. That query may report failover_ready as true for an invalidated
replication slot if the replication slot didn't conflict with recovery.
Fix the query to check NULL invalidation_reason instead of conflicting
column.

Author: Shveta Malik
Reviewed by: Ashutosh Bapat
Discussion: https://www.postgresql.org/message-id/CAJpy0uCNfu=t_l4o56xqed0fautuug7nyquoqhpfgansugt...@mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 7b3558ff248..e26f7f59d4a 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -764,7 +764,7 @@ HINT:  To initiate replication, you must manually create the replication slot, e
      Check that the logical replication slots identified above exist on
      the standby server and are ready for failover.
 <programlisting>
-/* standby # */ SELECT slot_name, (synced AND NOT temporary AND NOT conflicting) AS failover_ready
+/* standby # */ SELECT slot_name, (synced AND NOT temporary AND invalidation_reason IS NULL) AS failover_ready
                FROM pg_replication_slots
                WHERE slot_name IN
                    ('sub1','sub2','sub3', 'pg_16394_sync_16385_7394666715149055164');
-- 
2.34.1

Reply via email to