At Wed, 5 Apr 2023 11:55:14 -0700, Andres Freund <and...@anarazel.de> wrote in > Hi, > > On 2023-04-05 11:48:53 -0700, Andres Freund wrote: > > Note that a checkpoint started at "17:50:23.787", but didn't finish before > > the > > database was shut down. As far as I can tell, this can not be caused by > > checkpoint_timeout, because by the time we get to invalidating replication > > slots, we already did CheckPointBuffers(), and that's the only thing that > > delays based on checkpoint_timeout. > > > > ISTM that this indicates that checkpointer got stuck after signalling > > 344783. > > > > Do you see any other explanation? > > This all sounded vaguely familiar. After a bit bit of digging I found this: > > https://postgr.es/m/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de > > Which seems like it plausibly explains the failed test?
As my understanding, ConditionVariableSleep() can experience random wake-ups and ReplicationSlotControlLock doesn't prevent slot release. So, I can imagine a situation where that blocking might happen. If the call ConditionVariableSleep(&s->active_cv) wakes up unexpectedly due to a latch set for reasons other than the CV broadcast, and the target process releases the slot between fetching active_pid in the loop and the following call to ConditionVariablePrepareToSleep(), the CV broadcast triggered by the slot release might be missed. If that's the case, we'll need to check active_pid again after the calling ConditionVariablePrepareToSleep(). Does this make sense? regards. -- Kyotaro Horiguchi NTT Open Source Software Center