Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

shveta malik Wed, 11 Mar 2026 01:45:51 -0700

On Thu, Feb 26, 2026 at 4:16 PM SATYANARAYANA NARLAPURAM
<[email protected]> wrote:
>
> Hi Ashutosh,
>
> On Thu, Feb 26, 2026 at 1:11 AM Ashutosh Sharma <[email protected]> wrote:
>>
>> Hi,
>>
>> On Thu, Feb 26, 2026 at 2:15 PM shveta malik <[email protected]> wrote:
>> >
>> > On Thu, Feb 26, 2026 at 1:54 PM SATYANARAYANA NARLAPURAM
>> > <[email protected]> wrote:
>> > >
>> > > Hi Ashutosh,
>> > >
>> > > On Wed, Feb 25, 2026 at 11:42 PM Ashutosh Sharma <[email protected]> 
>> > > wrote:
>> > >>
>> > >>
>> > >> I don't think we should be comparing "synchronous_standby_names" with
>> > >> "synchronized_standby_slots", even though they appear similar in
>> > >> purpose. All values listed in synchronous_standby_names represent
>> > >> synchronous standbys exclusively, whereas synchronized_standby_slots
>> > >> can hold values for both synchronous and asynchronous standbys. In
>> > >> other words, every server referenced by synchronous_standby_names is
>> > >> of the same type, but that may not be the case with
>> > >> synchronized_standby_slots.
>> > >>
>> > >> If a GUC can hold values of different types (sync vs. async), does it
>> > >> really make sense to use a qualifier like ANY 1 (val1, val2) when val1
>> > >> and val2 are different in nature? For example, suppose val1 is a
>> > >> synchronous standby and val2 is an asynchronous standby, and we
>> > >> configure ANY 1 (val1, val2). It's possible for val2 to get ahead of
>> > >> val1 in terms of replication progress, which in turn could mean the
>> > >> logical replica is also ahead of val1. So if we were to fail over to
>> > >> val1 (since it's the only synchronous standby), we will not be able to
>> > >> use the existing logical replication setup.
>> > >
>> > >
>> > > If the failover orchestrator cannot ensure standby1 to not get the 
>> > > quorum committed WAL (from archive or standby2) then the setting ANY 1 
>> > > (val1, val2) is invalid.
>> > > This setup also has issues because in your scenario, standby2 is ahead 
>> > > of the new primary (standby1) and standby2 requires now to rewind to be 
>> > > in sync with the new primary. Additionally, it allowed readers to read 
>> > > data that was lost at the end of the failover. We ideally need a 
>> > > mechanism to not send WAL to async replicas before the sync replicas 
>> > > commit  (honoring syncrhnous_standby_names GUC) feature (similar to 
>> > > synchronized_standby_slots). It could be a different thread on its own.
>> >
>> >
>> > +1 on the overall idea of the patch.
>> > I understand the concern raised above that one of the standbys in the
>> > quorum (synchronized_standby_slots) might lag behind the logical
>> > replica, and a user could potentially failover to such a standby. But
>> > I also agree with Amit that configuring failover correctly is
>> > ultimately the responsibility of failover-solution. And instructions
>> > in doc should be followed before deciding if a standby is
>> > failover-ready or not.
>> >
>> > As suggested in [1], IMO, it is a reasonably good idea for
>> > 'synchronized_standby_slots' to DEFAULT to the value of
>> > 'synchronous_standby_names'. That way, even if the user missed to
>> > configure 'synchronized_standby_slots' explicitly, we would still have
>> > reasonable protection in place. At the same time, if a user
>> > intentionally chooses not to configure it, a NULL/NONE value should
>> > remain a valid option.
>> >
>>
>> AFAIU, not all names listed in "synchronous_standby_names" are
>> necessarily synchronous standbys. Tools like pg_receivewal, for
>> example, can establish a replication connection to the primary and
>> appear in that list. Therefore, deriving "synchronized_standby_slots"
>> from "synchronous_standby_names", if not set by the user would cause
>> logical slots to be synchronized to whatever nodes those names
>> represent, including a host running pg_receivewal, which is certainly
>> not something the user would have intended to do. Therefore I feel
>> this might not just be the good choice.
>
>
> Agreed, not a good idea to have  synchronized_standby_slots default to 
> synchronous_standby_names because application_names and slot names are 
> different as stated.
>


yes, agreed. Sorry I missed this point earlier.

> Submitting a new version of the patch based on Satya's earlier work - [1].
>
> Please take a look and let us know your thoughts.


Had a look at the patch. Few concerns:

1)
StandbySlotsHaveCaughtup:

+ * If a slot listed in synchronized_standby_slots is not found,
+ * report an error.

--

+ * If a slot is not physical, report error.

These comments are misleading as we may or may not report ERROR
depending upon elevel passed by caller.

2)
It seems to me (not tested yet) that even in priority based and quorum
based configuration, if we have found our N slots, still we will end
up emitting WARNING message for invalidated, missing slots etc such
as:

a)
repplication slot \"%s\" specified in parameter \"%s\" does not exist
Logical replication is waiting on the standby associated with replication slot.

b)
cannot specify logical replication slot \"%s\" in parameter
Logical replication is waiting for correction on replication slot.

c)
physical replication slot \"%s\" specified in parameter \"%s\" has
been invalidated
Logical replication is waiting on the standby associated with replication slot.

These messages may give the impression that logical replication is
actually waiting, even though it might already be progressing normally
because the required N slots have been found.

OTOH, if we suppress these messages, there could be cases where we
fail to find the required N valid slots and logical replication
genuinely ends up waiting, but the user would receive no indication of
the problem.

One thing for sure is, that we need to emit such messages when
'wait_for_all' is true, but for rest, we can not decide inline.

So there are 2 options:
a) either we change the DETAIL slightly with each such reporting to
say below  or anything better:

DETAIL: logical replication may wait if the required number of standby
slots is not available.

b)
Or we collect the slot-names and emit the message at the end only 'if
(caught_up_slot_num < required)'. Something like:

WARNING: Some replication slots specified in parameter "%s" are not
valid: invalidated (slot1, slot2), logical (slot3), missing (slot4).
DETAIL: Logical replication is waiting because the required number of
standby slots is not available.


Thoughts?

3)
If elevel is ERROR, do we want to error-out on the first occurence of
invalid/missing slot? Shouldn't it look for first valid N slots in
case of priority, quorum and then decide for ERROR? Currently it seems
it will error out. So, if we go with solution 2b, this can also be
resolved with that.

~~

I have not tested the patch, so let me know if I have mis-understood the logic.

thanks
Shveta

Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

Reply via email to