On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote: > On 2021-Sep-20, Michael Paquier wrote: >> The test gets the right PIDs, as the logs showed: >> ok 17 - have walsender pid 12663 >> ok 18 - have walreceiver pid 12662 > > As I understood, Horiguchi-san's point isn't that the PIDs might be > wrong -- the point is to make sure that the process is in state T before > moving on to the next step in the test.
I have spent more time on this issue, as it looks that I am the only one with an environment able to reproduce it (Big Sur 11.6). As far as I can see, the states of the WAL sender and receiver are fine, after adding some extra debugging with ps called from the test itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a failure shows up. In a sequence that passes, we have the following sequence: - Start checkpoint. - SIGSTOP sent to WAL sender and receiver. - Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal) - Check that WAL sender is stopped - SIGCONT on WAL sender. - Invalidate the slot, checkpoint completes. - Check state of pg_replication_slots. - Check that slot invalidation happened in the logs. A failed sequence starts a checkpoint, but never completes it when we reach the check on pg_replication_slots and the test remains stuck so the slot is never invalidated. I have played a bit with the test and switched a bit the location of the test "slot invalidation logged" that watches the logs, and the test fails to find the slot invalidation, as a result of the checkpoint not finishing. To keep the instance around for debugging, I have just launched an extra checkpoint after the SIGCONT sent to the WAL sender. It remains stuck as an effect of the first one: kill 'CONT', $senderpid; +$node_primary3->safe_psql('postgres', 'checkpoint;'); With that, I am able to grab the checkpointer of primary3 to see where it is waiting: * frame #0: 0x00007fff204f8c4a libsystem_kernel.dylib`kevent + 10 frame #1: 0x0000000105a81a43 postgres`WaitEventSetWaitBlock(set=0x00007fb87f008748, cur_timeout=-1, occurred_events=0x00007ffeea765400, nevents=1) at latch.c:1601:7 frame #2: 0x0000000105a80fd0 postgres`WaitEventSetWait(set=0x00007fb87f008748, timeout=-1, occurred_events=0x00007ffeea765400, nevents=1, wait_event_info=134217769) at latch.c:1396:8 frame #3: 0x0000000105a80b46 postgres`WaitLatch(latch=0x00000001069ae7a4, wakeEvents=33, timeout=-1, wait_event_info=134217769) at latch.c:473:6 frame #4: 0x0000000105a97011 postgres`ConditionVariableTimedSleep(cv=0x00000001069d8860, timeout=-1, wait_event_info=134217769) at condition_variable.c:163:10 frame #5: 0x0000000105a96f32 postgres`ConditionVariableSleep(cv=0x00000001069d8860, wait_event_info=134217769) at condition_variable.c:100:9 frame #6: 0x0000000105a299cf postgres`InvalidatePossiblyObsoleteSlot(s=0x00000001069d8780, oldestLSN=8388608, invalidated=0x00007ffeea76559f) at slot.c:1264:4 frame #7: 0x0000000105a296bd postgres`InvalidateObsoleteReplicationSlots(oldestSegno=8) at slot.c:1333:7 frame #8: 0x00000001055edbe6 postgres`CreateCheckPoint(flags=192) at xlog.c:9275:6 frame #9: 0x00000001059b753d postgres`CheckpointerMain at checkpointer.c:448:5 frame #10: 0x00000001059b470d postgres`AuxiliaryProcessMain(auxtype=CheckpointerProcess) at auxprocess.c:153:4 frame #11: 0x00000001059c8912 postgres`StartChildProcess(type=CheckpointerProcess) at postmaster.c:5498:3 frame #12: 0x00000001059c68fe postgres`PostmasterMain(argc=4, argv=0x00007fb87e505400) at postmaster.c:1458:21 frame #13: 0x000000010589e1bf postgres`main(argc=4, argv=0x00007fb87e505400) at main.c:198:3 frame #14: 0x00007fff20544f3d libdyld.dylib`start + 1 So there is really something fishy here IMO, something else than just a test mis-design and it looks like a race condition, perhaps around InvalidateObsoleteReplicationSlots(). Thoughts? -- Michael
signature.asc
Description: PGP signature