On Tue, May 19, 2026 at 8:30 PM Xuneng Zhou <[email protected]> wrote: > > Hi, > > On Tue, May 19, 2026 at 1:00 PM Alexander Lakhin <[email protected]> wrote: > > > > Hello Alexander and Xuneng, > > > > 06.04.2026 22:49, Alexander Korotkov wrote: > > > > Thank you, I've pushed your version of patchset. I made two minor > > corrections for patch #2: mention default mode value in the header > > comment, and fallback to polling on has_wal_read_bug sparc64+ext4 bug. > > > > > > I discovered a new test failure, that is apparently caused by new > > wait_for_catchup() implementation [1]: > > [06:20:23.110](1.069s) not ok 8 - check that the slot state changes to > > "extended" > > [06:20:23.110](0.001s) # Failed test 'check that the slot state changes > > to "extended"' > > # at > > /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl > > line 140. > > [06:20:23.111](0.000s) # got: 'unreserved' > > # expected: 'extended' > > [06:20:23.231](0.120s) not ok 9 - check that the slot state changes to > > "unreserved" > > [06:20:23.231](0.000s) # Failed test 'check that the slot state changes > > to "unreserved"' > > # at > > /Users/ec2-user/bf/goldfish/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl > > line 152. > > [06:20:23.231](0.000s) # got: 'lost|' > > # expected: 'unreserved|t' > > > > I've managed to reproduce such failures with: > > diff --git a/src/backend/replication/walreceiver.c > > b/src/backend/replication/walreceiver.c > > index 07eac07b9ce..493ce92674e 100644 > > --- a/src/backend/replication/walreceiver.c > > +++ b/src/backend/replication/walreceiver.c > > @@ -1143,2 +1143,3 @@ XLogWalRcvSendReply(bool force, bool requestReply, > > bool checkApply) > > > > +pg_usleep(10000); > > /* Get current timestamp. */ > > diff --git a/src/backend/replication/walsender.c > > b/src/backend/replication/walsender.c > > index 04aa770d981..19cda3a6b51 100644 > > --- a/src/backend/replication/walsender.c > > +++ b/src/backend/replication/walsender.c > > @@ -2521,2 +2521,3 @@ ProcessStandbyReplyMessage(void) > > > > +pg_usleep(100000); > > /* the caller already consumed the msgtype byte */ > > > > Concretely, a loop: > > for i in {1..100}; do echo "ITERATION $i"; PROVE_TESTS="t/019*" make -s > > check -C src/test/recovery/ || break; done > > failed for me on iterations 2, 1, 7: > > ITERATION 7 > > # +++ tap check in src/test/recovery +++ > > t/019_replslot_limit.pl .. 8/? > > # Failed test 'check that the slot state changes to "extended"' > > # at t/019_replslot_limit.pl line 140. > > # got: 'unreserved' > > # expected: 'extended' > > t/019_replslot_limit.pl .. 26/? # Looks like you failed 1 test of 26. > > t/019_replslot_limit.pl .. Dubious, test returned 1 (wstat 256, 0x100) > > Failed 1/26 subtests > > > > With "WAIT FOR LSN" in wait_for_catchup() disabled, 100 iterations > > passed. > > > > Having extra logging added, I could see the key difference. > > Failed run: > > 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl > > LOG: !!!GetWALAvailability| targetLSN: 0/016C0000, targetSeg: 22, > > oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 22 > > 2026-05-19 22:01:37.968 EEST client backend[3632148] 019_replslot_limit.pl > > STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = > > 'rep1' > > vs > > Successful run: > > 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl > > LOG: !!!GetWALAvailability| targetLSN: 0/01700000, targetSeg: 23, > > oldestSlotSeg: 23, oldestSegMaxWalSize: 24, oldestSeg: 23 > > 2026-05-19 22:04:18.102 EEST client backend[3633761] 019_replslot_limit.pl > > STATEMENT: SELECT wal_status FROM pg_replication_slots WHERE slot_name = > > 'rep1' > > > > That is, with WAIT FOR LSN, primary in this test may advance > > slot->data.restart_lsn to the expected position after wait_for_catchup() > > returns. > > > > [1] > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=goldfish&dt=2026-05-13%2006%3A15%3A03 > > Thanks for reporting this issue. > > I think this is related to the semantic change made earlier: > wait_for_catchup() now returns once the standby itself has reached the > target LSN, rather than waiting until the primary observes that > position via pg_stat_replication. > > As a result, the primary may not yet have processed the standby > feedback needed to advance the slot's restart_lsn when > wait_for_catchup() returns. > > Actually, I was aware of this semantic change and previously thought > it might be harmless. But this failure appears to disprove that. I'll > prepare a patch to fix this shortly.
After some consideration, 019_replslot_limit.pl appears to the better place to place the fix rather than by restoring the old primary-side polling barrier in wait_for_catchup(). The new wait_for_catchup() behavior is closer to its natural semantics: for replay/write/flush modes, it waits until the standby itself has reached the requested LSN. The old implementation used pg_stat_replication on the primary, which also implied that the primary had received and processed standby feedback. That was a useful side effect for this test, but it is not required by most callers. 019_replslot_limit.pl is different because it checks primary-side slot state. For a physical slot, restart_lsn advances only after the primary's walsender processes standby feedback. So the test needs an extra condition beyond ordinary standby catchup. The patch makes that dependency explicit: wait for the standby to replay the target LSN, then wait for the slot's restart_lsn on the primary to pass the same LSN. This keeps wait_for_catchup() focused on standby catchup while making the slot-specific synchronization visible in the test that needs it. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
0001-Stabilize-replslot-limit-test-after-standby-catchup.patch
Description: Binary data
