On Mon, Sep 20, 2021 at 09:38:29AM -0300, Alvaro Herrera wrote:
> On 2021-Sep-20, Michael Paquier wrote:
>> The test gets the right PIDs, as the logs showed:
>> ok 17 - have walsender pid 12663
>> ok 18 - have walreceiver pid 12662
> 
> As I understood, Horiguchi-san's point isn't that the PIDs might be
> wrong -- the point is to make sure that the process is in state T before
> moving on to the next step in the test.

I have spent more time on this issue, as it looks that I am the only
one with an environment able to reproduce it (Big Sur 11.6).

As far as I can see, the states of the WAL sender and receiver are
fine, after adding some extra debugging with ps called from the test
itself, and I have checked that they are SIGSTOP'd or SIGCONT'd when a
failure shows up.

In a sequence that passes, we have the following sequence:
- Start checkpoint.
- SIGSTOP sent to WAL sender and receiver.
- Advance WAL (CREATE TABLE, DROP TABLE and pg_switch_wal)
- Check that WAL sender is stopped
- SIGCONT on WAL sender.
- Invalidate the slot, checkpoint completes.
- Check state of pg_replication_slots.
- Check that slot invalidation happened in the logs.

A failed sequence starts a checkpoint, but never completes it when
we reach the check on pg_replication_slots and the test remains
stuck so the slot is never invalidated.  I have played a bit with the
test and switched a bit the location of the test "slot invalidation
logged" that watches the logs, and the test fails to find the slot
invalidation, as a result of the checkpoint not finishing.

To keep the instance around for debugging, I have just launched an
extra checkpoint after the SIGCONT sent to the WAL sender.  It remains
stuck as an effect of the first one:
 kill 'CONT', $senderpid;
+$node_primary3->safe_psql('postgres', 'checkpoint;');

With that, I am able to grab the checkpointer of primary3 to see where
it is waiting:
  * frame #0: 0x00007fff204f8c4a libsystem_kernel.dylib`kevent + 10
    frame #1: 0x0000000105a81a43 
postgres`WaitEventSetWaitBlock(set=0x00007fb87f008748, cur_timeout=-1, 
occurred_events=0x00007ffeea765400, nevents=1) at latch.c:1601:7
    frame #2: 0x0000000105a80fd0 
postgres`WaitEventSetWait(set=0x00007fb87f008748, timeout=-1, 
occurred_events=0x00007ffeea765400, nevents=1, wait_event_info=134217769) at 
latch.c:1396:8
    frame #3: 0x0000000105a80b46 postgres`WaitLatch(latch=0x00000001069ae7a4, 
wakeEvents=33, timeout=-1, wait_event_info=134217769) at latch.c:473:6
    frame #4: 0x0000000105a97011 
postgres`ConditionVariableTimedSleep(cv=0x00000001069d8860, timeout=-1, 
wait_event_info=134217769) at condition_variable.c:163:10
    frame #5: 0x0000000105a96f32 
postgres`ConditionVariableSleep(cv=0x00000001069d8860, 
wait_event_info=134217769) at condition_variable.c:100:9
    frame #6: 0x0000000105a299cf 
postgres`InvalidatePossiblyObsoleteSlot(s=0x00000001069d8780, 
oldestLSN=8388608, invalidated=0x00007ffeea76559f) at slot.c:1264:4
    frame #7: 0x0000000105a296bd 
postgres`InvalidateObsoleteReplicationSlots(oldestSegno=8) at slot.c:1333:7
    frame #8: 0x00000001055edbe6 postgres`CreateCheckPoint(flags=192) at 
xlog.c:9275:6
    frame #9: 0x00000001059b753d postgres`CheckpointerMain at 
checkpointer.c:448:5
    frame #10: 0x00000001059b470d 
postgres`AuxiliaryProcessMain(auxtype=CheckpointerProcess) at auxprocess.c:153:4
    frame #11: 0x00000001059c8912 
postgres`StartChildProcess(type=CheckpointerProcess) at postmaster.c:5498:3
    frame #12: 0x00000001059c68fe postgres`PostmasterMain(argc=4, 
argv=0x00007fb87e505400) at postmaster.c:1458:21
    frame #13: 0x000000010589e1bf postgres`main(argc=4, 
argv=0x00007fb87e505400) at main.c:198:3
    frame #14: 0x00007fff20544f3d libdyld.dylib`start + 1

So there is really something fishy here IMO, something else than just
a test mis-design and it looks like a race condition, perhaps around
InvalidateObsoleteReplicationSlots().

Thoughts?
--
Michael

Attachment: signature.asc
Description: PGP signature

Reply via email to