Re: Improve pg_sync_replication_slots() to wait for primary to advance

shveta malik Tue, 10 Feb 2026 01:34:17 -0800

On Tue, Feb 10, 2026 at 1:25 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Tuesday, February 10, 2026 3:02 PM shveta malik <[email protected]> 
> wrote:
> >
> > On Mon, Feb 9, 2026 at 11:36 AM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> > >
> > > Adjusted the comments as suggested.
> > >
> > > In addition to addressing the comments, I revisited the recently
> > > updated slotsync code and noticed opportunities to simplify some
> > > parameters, checks, and codes. This will also facilitate the improvement 
> > > in
> > v2-0001 coding.
> > >
> > > * Previously, certain function parameters(found_consistent_snapshot,
> > > remote_slot_precedes of update_local_synced_slot()) were used to store
> > > the reason for slot synchronization being skipped. However, now that a
> > > slot property serves this purpose, we can simplify the code by
> > > eliminating those redundant parameters and using the slot's property to
> > perform the same check.
> > >
> > > * The slot synchronization is skipped if the required WAL has not been
> > > received and flushed. Previously, this check[1] was performed in two
> > separate code paths.
> > > Such duplication can lead to coding errors if changes are made in one
> > > location without updating the other, as exemplified by the issue fixed in
> > commit 3df4df5.
> > > This commit consolidates the check into a single location to eliminate
> > > redundancies and reduce the potential for future errors.
> > >
> > > To address these points, I have created two additional patches:
> > > V3-0001 for the first point and V3-0002 for the second. V3-0003
> > > contains the current improvement being discussed, and it's also simplified
> > thanks to the preceding patches.
> > >
> > I like the idea of both the new patches. Please find a few trivial comments:
>
> Thanks for the comments.
>
> >
> > patch002:
> > 1)
> > Earlier at both the places where we were updating
> > 'SS_SKIP_WAL_NOT_FLUSHED', we were returning slot_updated as false,
> > now, we might end up returning it as true (specially at second occurrence). 
> > Is
> > this intentional?
>
> Yes. I think it's OK in the second occurrence because we did create a new temp
> slot and give some initial value for the slot. I think it's similar to
> SS_SKIP_WAL_OR_ROWS_REMOVED and SS_SKIP_NO_CONSISTENT_SNAPSHOT where we also
> return slot_updated=true in case of initial sync.
>
> >
> > 2)
> > In update_and_persist_local_synced_slot(), we can reach this even when
> > wal_not_flushed, so we shall to update comment:
> >         if (slot->slotsync_skip_reason != SS_SKIP_NONE)
> >         {
> >                 /*
> >                  * We reach here when the remote slot didn't catch up to 
> > locally
> >                  * reserved position, or it cannot reach the consistent 
> > point from the
> >                  * restart_lsn.
> > ....
> > */
>
> Updated.
>
> >
> > Patch003:
> > 3)
> > + if (slotsync_pending && slot->slotsync_skip_reason != SS_SKIP_NONE)
> > + *slotsync_pending = true;
> >
> > Here shall we ensure by a sanity check that slotsync_skip_reason !=
> > SS_SKIP_INVALID?
>
> Added an Assert for it.
>
> > And please bring back the comment as well, which was
> > there in an earlier patch which stated the reason for not including
> > 'SS_SKIP_INVALID' here.
>
> After rethinking, I chose to add the comments atop of file
> along with other related comments.
>


Thanks for the patch.

+ * Note that we do not wait and retry if the local slot has been invalidated.
+ * In such cases, the corresponding remote slot on the primary is likely
+ * invalidated as well. Even if only the local slot is invalidated, simply
+ * retrying synchronization won't suffice, as it requires further user actions
+ * to verify the server configuration, drop the invalidated slot.

On thinking more, I realized that if the local slot is invalidated
alone while the remote-slot is not, we do not wait for the user to
drop such an invalidated slot. Instead slot-sync will drop it
internally. See comments atop drop_local_obsolete_slots(). This makes
me wonder whether such a case, where only the local slot is
invalidated, should also set slotsync_pending = true, since there is a
good chance it will get synchronized in subsequent runs. OTOH, if we
do not wait for such a slot, we could end up in a situation where the
slot (remote one) is valid pre-failover but is invalid (synced one)
post-failover, even after running the API immediately before
switchover. Thoughts?

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

Reply via email to