On Tue, Feb 10, 2026 at 1:25 PM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > On Tuesday, February 10, 2026 3:02 PM shveta malik <[email protected]> > wrote: > > > > On Mon, Feb 9, 2026 at 11:36 AM Zhijie Hou (Fujitsu) > > <[email protected]> wrote: > > > > > > Adjusted the comments as suggested. > > > > > > In addition to addressing the comments, I revisited the recently > > > updated slotsync code and noticed opportunities to simplify some > > > parameters, checks, and codes. This will also facilitate the improvement > > > in > > v2-0001 coding. > > > > > > * Previously, certain function parameters(found_consistent_snapshot, > > > remote_slot_precedes of update_local_synced_slot()) were used to store > > > the reason for slot synchronization being skipped. However, now that a > > > slot property serves this purpose, we can simplify the code by > > > eliminating those redundant parameters and using the slot's property to > > perform the same check. > > > > > > * The slot synchronization is skipped if the required WAL has not been > > > received and flushed. Previously, this check[1] was performed in two > > separate code paths. > > > Such duplication can lead to coding errors if changes are made in one > > > location without updating the other, as exemplified by the issue fixed in > > commit 3df4df5. > > > This commit consolidates the check into a single location to eliminate > > > redundancies and reduce the potential for future errors. > > > > > > To address these points, I have created two additional patches: > > > V3-0001 for the first point and V3-0002 for the second. V3-0003 > > > contains the current improvement being discussed, and it's also simplified > > thanks to the preceding patches. > > > > > I like the idea of both the new patches. Please find a few trivial comments: > > Thanks for the comments. > > > > > patch002: > > 1) > > Earlier at both the places where we were updating > > 'SS_SKIP_WAL_NOT_FLUSHED', we were returning slot_updated as false, > > now, we might end up returning it as true (specially at second occurrence). > > Is > > this intentional? > > Yes. I think it's OK in the second occurrence because we did create a new temp > slot and give some initial value for the slot. I think it's similar to > SS_SKIP_WAL_OR_ROWS_REMOVED and SS_SKIP_NO_CONSISTENT_SNAPSHOT where we also > return slot_updated=true in case of initial sync. > > > > > 2) > > In update_and_persist_local_synced_slot(), we can reach this even when > > wal_not_flushed, so we shall to update comment: > > if (slot->slotsync_skip_reason != SS_SKIP_NONE) > > { > > /* > > * We reach here when the remote slot didn't catch up to > > locally > > * reserved position, or it cannot reach the consistent > > point from the > > * restart_lsn. > > .... > > */ > > Updated. > > > > > Patch003: > > 3) > > + if (slotsync_pending && slot->slotsync_skip_reason != SS_SKIP_NONE) > > + *slotsync_pending = true; > > > > Here shall we ensure by a sanity check that slotsync_skip_reason != > > SS_SKIP_INVALID? > > Added an Assert for it. > > > And please bring back the comment as well, which was > > there in an earlier patch which stated the reason for not including > > 'SS_SKIP_INVALID' here. > > After rethinking, I chose to add the comments atop of file > along with other related comments. >
Thanks for the patch. + * Note that we do not wait and retry if the local slot has been invalidated. + * In such cases, the corresponding remote slot on the primary is likely + * invalidated as well. Even if only the local slot is invalidated, simply + * retrying synchronization won't suffice, as it requires further user actions + * to verify the server configuration, drop the invalidated slot. On thinking more, I realized that if the local slot is invalidated alone while the remote-slot is not, we do not wait for the user to drop such an invalidated slot. Instead slot-sync will drop it internally. See comments atop drop_local_obsolete_slots(). This makes me wonder whether such a case, where only the local slot is invalidated, should also set slotsync_pending = true, since there is a good chance it will get synchronized in subsequent runs. OTOH, if we do not wait for such a slot, we could end up in a situation where the slot (remote one) is valid pre-failover but is invalid (synced one) post-failover, even after running the API immediately before switchover. Thoughts? thanks Shveta
