On Wed, 22 Apr 2026 at 21:05, Andres Freund <[email protected]> wrote: > > Hi, > > On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote: > > If the PSB is emitted (and signaled to checkpointer) before the > > checkpointer has registered its SIGUSR1 handler, then the checkpointer > > won't receive the notice to check its procsignal slots, it won't > > notice the updated procsignal flags, and it won't process the PSB; not > > until it receives a new SIGUSR1. > > > > Signals are sent to all processes that have their procsignal pss_pid > > set, which is true for every process which has called ProcSignalInit, > > which for the checkpointer (like other aux processes) happens in > > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux > > processes) calls AuxiliaryProcessMainCommon before registering its > > signal handlers, creating a small window in time where signals are > > sent, but not handled. > > Hm. Have we confirmed this happens? > > CheckpointerMain() is called with all signals masked, so it should be ok for > the signal handler to only be set up after AuxiliaryProcessMainCommon(), as > long as it happens before [...]
Yeah, that was a misidentification of the exact race that caused the issue. On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <[email protected]> wrote: > > On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote: > > > > Hello Sawada-san, > > > > 24.04.2026 20:52, Masahiko Sawada wrote: > > > > Right. The postmaster blocks all signals before starting child process > > as the following comment explains: > > > > /* > > * We start postmaster children with signals blocked. This allows > > them to > > * install their own handlers before unblocking, to avoid races where > > they > > * might run the postmaster's handler and miss an important control > > * signal. With more analysis this could potentially be relaxed. > > */ > > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask); > > > > Investigating the issue, I found there is a race condition between the > > procsignal initialization and emitting signal barrier that could be > > the cause of this issue. Imagine the following scenario: Ah, that'd be it indeed. Thanks! > I've attached a patch to address the issue. I haven't verified it > across all versions yet, but I suspect it exists in the stable > branches as well. Previously, the issue rarely occurred because > EmitProcSignalBarrier() was only used for smgr invalidation. However, > now that we use signal barriers for online wal_level changes and > checksum status updates, this race condition is likely to be > encountered more frequently. Yes, I think the boot process with the xlog_logical_info barrier is more likely to hit this issue; as indicated by two known detected cases in various CI jobs; though it could also be that the lockup of the new barrier is just exceptionally bad for system stability. As for the patches: v1-0001 -- LGTM. 0001 (upthread): LGTM, but I'd also suggest to add some code to make sure that we're actually receiving procsignals by the time we initialize the Logical/Checksum subsystems that need to process shared state changes by responding to procsignals; as attached. smgr's procsignal doesn't really depend on shared memory state, so I've kept that out of my patch. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com)
v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch
Description: Binary data
