On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <[email protected]> wrote: > > On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <[email protected]> wrote: > > > > Dear Sawada-san, > > > > 01.05.2026 01:08, Masahiko Sawada wrote: > > > > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <[email protected]> > > wrote: > > > > I was wondering why is that failure the only one of this kind on buildfarm > > (in last two years, at least), so I've tried to reproduce it on > > REL_18_STABLE... and failed. > > > > Then I've bisected it on the master branch and found (your) commit that > > introduced this behavior: 67c20979c from 2025-12-23. > > > > I've confirmed that this race condition issue is present from v15 to > > the master. In v14, we have the procsignal barrier code but don't use > > it anywhere. In v18 or older, it could happen when executing DROP > > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen > > in more cases as we're using procsignal barrier more places. In any > > case, if a process emits a signal barrier when another process is > > between the initialization of slot->pss_barrierGeneration and > > slot->pss_pid initialization, the subsequent > > WaitForProcSignalBarrier() ends up waiting for that process forever. > > So I think the patch should be backpatched to v15. Please review these > > patches. > > > > > > Yes, you're right -- it's not reproduced on REL_18_STABLE with > > test_oat_hooks, which simply starts postgres node (as many other tests), > > but when I tried the full test suite with the sleep inserted before > > setting pss_pid, I discovered the following vulnerable tests: > > > > 030_stats_cleanup_replica_standby.log > > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with > > PID 2086578 to accept ProcSignalBarrier > > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for > > Database/DROP: dir 1663/16393 > > > > 033_replay_tsp_drops_standby2_FILE_COPY.log > > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with > > PID 2030901 to accept ProcSignalBarrier > > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for > > Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389 > > > > 040_standby_failover_slots_sync_publisher.log > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl > > LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier > > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl > > STATEMENT: DROP DATABASE slotsync_test_db; > > > > 002_compare_backups_pitr1.log > > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with > > PID 1829396 to accept ProcSignalBarrier > > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for > > Database/DROP: dir 1663/16414 > > > > I've tried my repro with 033_replay_tsp_drops and it really fails on > > REL_15_STABLE..master and doesn't fail on REL_14_STABLE. > > > > FYI I found that we had a similar report[1] last year, I'm not sure > > it hit the exact same issue, though. > > > > Regards, > > > > [1] > > https://www.postgresql.org/message-id/cagqgydtavkg3dbtebtyxzlm48jmzr2bcvteybswlv5hvwsb...@mail.gmail.com > > > > > > Yeah, and probably this one: > > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru > > > > By the way, mamba produced the same failure just yesterday: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39 > > > > # Running: pg_ctl --wait --pgdata > > /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata > > --log > > /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log > > --options --cluster-name=primary start > > waiting for server to > > start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... > > stopped waiting > > pg_ctl: server did not start in time > > 004_restart_primary.log > > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with > > PID 11506 to accept ProcSignalBarrier > > ... > > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend > > with PID 11506 to accept ProcSignalBarrier > > > > The proposed patches make the test pass reliably for me in all affected > > branches. Thank you for working on this! > > > > Thank you for checking this issue on stable branches too! > > Considering that this issue is not very visible in practice and we're > going to release new minor versions next week, I'm planning to push > these fixes to master and backbranches after the minor releases. That > way, we can fix the issue on the master relatively soon and have > enough time to verify that fix works well on backbranches. >
While reviewing the patches, I realized that it would be better to use pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() + pg_memory_barrier() where available. I've updated the patch for master and 18, and slightly commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
From b7606bea5ad7564b73ea4a2575f547113e532018 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Tue, 28 Apr 2026 12:21:21 -0700 Subject: [PATCH v1] Fix race between ProcSignalInit() and EmitProcSignalBarrier(). Previously, ProcSignalInit() read the global barrier generation before publishing its PID intopss_pid. This created a race condition: a process could initialize its local generation with an older global value, while a concurrent EmitProcSignalBarrier() might skip that process because its pss_pid was still zero. This resulted in WaitForProcSignalBarrier() hanging indefinitely. Fix this by publishing pss_pid before reading psh_barrierGeneration with a memory barrier so that the store to pss_pid is ordered before the load. A concurrent EmitProcSignalBarrier() then either observes the published PID and signals this slot, or completes its generation increment before we load it. While this race has become more visible due to recent features using signal barriers in more places (such as online wal_level changes), the issue is theoretically present since signal barriers were introduced to release smgr caches (e.g., in DROP DATABASE). v14 has the procsiangl barrier infrastricutre but no in-tree caller that actually emits a barrier, so the case is unreachable there. This issue was also reported by buildfarm member flaviventris. Reported-by: Melanie Plageman <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Reviewed-by: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com Backpatch-through: 15 --- src/backend/storage/ipc/procsignal.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index d6857f5a8bb..50b3cb2fd7b 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -175,6 +175,16 @@ ProcSignalInit(void) /* Clear out any leftover signal reasons */ MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t)); + /* + * Publish the PID before reading the global barrier generation to ensure + * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an + * older generation. We need a memory barrier here to make sure that the + * update of pss_pid is ordered before the subsequent load of + * psh_barrierGeneration. + */ + slot->pss_pid = MyProcPid; + pg_memory_barrier(); + /* * Initialize barrier state. Since we're a brand-new process, there * shouldn't be any leftover backend-private state that needs to be @@ -192,9 +202,6 @@ ProcSignalInit(void) pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation); pg_memory_barrier(); - /* Mark slot with my PID */ - slot->pss_pid = MyProcPid; - /* Remember slot location for CheckProcSignal */ MyProcSignalSlot = slot; -- 2.54.0
From 4979dfae9f8638627e5fb79cb0079e00883fd761 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Tue, 28 Apr 2026 12:21:21 -0700 Subject: [PATCH v1] Fix race between ProcSignalInit() and EmitProcSignalBarrier(). Previously, ProcSignalInit() read the global barrier generation before publishing its PID intopss_pid. This created a race condition: a process could initialize its local generation with an older global value, while a concurrent EmitProcSignalBarrier() might skip that process because its pss_pid was still zero. This resulted in WaitForProcSignalBarrier() hanging indefinitely. Fix this by publishing pss_pid before reading psh_barrierGeneration with a memory barrier so that the store to pss_pid is ordered before the load. A concurrent EmitProcSignalBarrier() then either observes the published PID and signals this slot, or completes its generation increment before we load it. While this race has become more visible due to recent features using signal barriers in more places (such as online wal_level changes), the issue is theoretically present since signal barriers were introduced to release smgr caches (e.g., in DROP DATABASE). v14 has the procsiangl barrier infrastricutre but no in-tree caller that actually emits a barrier, so the case is unreachable there. This issue was also reported by buildfarm member flaviventris. Reported-by: Melanie Plageman <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Reviewed-by: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com Backpatch-through: 15 --- src/backend/storage/ipc/procsignal.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 21a9fc0fdd2..f710815d9ec 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -175,6 +175,16 @@ ProcSignalInit(int pss_idx) /* Clear out any leftover signal reasons */ MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t)); + /* + * Publish the PID before reading the global barrier generation to ensure + * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an + * older generation. We need a memory barrier here to make sure that the + * update of pss_pid is ordered before the subsequent load of + * psh_barrierGeneration. + */ + slot->pss_pid = MyProcPid; + pg_memory_barrier(); + /* * Initialize barrier state. Since we're a brand-new process, there * shouldn't be any leftover backend-private state that needs to be @@ -192,9 +202,6 @@ ProcSignalInit(int pss_idx) pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation); pg_memory_barrier(); - /* Mark slot with my PID */ - slot->pss_pid = MyProcPid; - /* Remember slot location for CheckProcSignal */ MyProcSignalSlot = slot; -- 2.54.0
From 144ace5abf197b4435d9aa1e7525614c0a8ae70f Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Tue, 28 Apr 2026 12:21:21 -0700 Subject: [PATCH v1] Fix race between ProcSignalInit() and EmitProcSignalBarrier(). Previously, ProcSignalInit() read the global barrier generation before publishing its PID intopss_pid. This created a race condition: a process could initialize its local generation with an older global value, while a concurrent EmitProcSignalBarrier() might skip that process because its pss_pid was still zero. This resulted in WaitForProcSignalBarrier() hanging indefinitely. Fix this by publishing pss_pid before reading psh_barrierGeneration with a memory barrier so that the store to pss_pid is ordered before the load. A concurrent EmitProcSignalBarrier() then either observes the published PID and signals this slot, or completes its generation increment before we load it. While this race has become more visible due to recent features using signal barriers in more places (such as online wal_level changes), the issue is theoretically present since signal barriers were introduced to release smgr caches (e.g., in DROP DATABASE). v14 has the procsiangl barrier infrastricutre but no in-tree caller that actually emits a barrier, so the case is unreachable there. This issue was also reported by buildfarm member flaviventris. Reported-by: Melanie Plageman <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Reviewed-by: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com Backpatch-through: 15 --- src/backend/storage/ipc/procsignal.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 264e4c22ca6..1397f65f67b 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -188,6 +188,15 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) /* Clear out any leftover signal reasons */ MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t)); + /* + * Publish the PID before reading the global barrier generation to ensure + * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an + * older generation. We need a memory barrier here to make sure that the + * update of pss_pid is ordered before the subsequent load of + * psh_barrierGeneration. + */ + pg_atomic_write_membarrier_u32(&slot->pss_pid, MyProcPid); + /* * Initialize barrier state. Since we're a brand-new process, there * shouldn't be any leftover backend-private state that needs to be @@ -207,7 +216,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) if (cancel_key_len > 0) memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len); slot->pss_cancel_key_len = cancel_key_len; - pg_atomic_write_u32(&slot->pss_pid, MyProcPid); SpinLockRelease(&slot->pss_mutex); -- 2.54.0
From 8b303ee35ad640299c5706bceb401a2706a5be2f Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Tue, 28 Apr 2026 12:21:21 -0700 Subject: [PATCH v1] Fix race between ProcSignalInit() and EmitProcSignalBarrier(). Previously, ProcSignalInit() read the global barrier generation before publishing its PID intopss_pid. This created a race condition: a process could initialize its local generation with an older global value, while a concurrent EmitProcSignalBarrier() might skip that process because its pss_pid was still zero. This resulted in WaitForProcSignalBarrier() hanging indefinitely. Fix this by publishing pss_pid before reading psh_barrierGeneration with a memory barrier so that the store to pss_pid is ordered before the load. A concurrent EmitProcSignalBarrier() then either observes the published PID and signals this slot, or completes its generation increment before we load it. While this race has become more visible due to recent features using signal barriers in more places (such as online wal_level changes), the issue is theoretically present since signal barriers were introduced to release smgr caches (e.g., in DROP DATABASE). v14 has the procsiangl barrier infrastricutre but no in-tree caller that actually emits a barrier, so the case is unreachable there. This issue was also reported by buildfarm member flaviventris. Reported-by: Melanie Plageman <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Reviewed-by: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com Backpatch-through: 15 --- src/backend/storage/ipc/procsignal.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index c85cb5cc18d..9dfe000353d 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -176,6 +176,16 @@ ProcSignalInit(int pss_idx) /* Clear out any leftover signal reasons */ MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t)); + /* + * Publish the PID before reading the global barrier generation to ensure + * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an + * older generation. We need a memory barrier here to make sure that the + * update of pss_pid is ordered before the subsequent load of + * psh_barrierGeneration. + */ + slot->pss_pid = MyProcPid; + pg_memory_barrier(); + /* * Initialize barrier state. Since we're a brand-new process, there * shouldn't be any leftover backend-private state that needs to be @@ -193,9 +203,6 @@ ProcSignalInit(int pss_idx) pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation); pg_memory_barrier(); - /* Mark slot with my PID */ - slot->pss_pid = MyProcPid; - /* Remember slot location for CheckProcSignal */ MyProcSignalSlot = slot; -- 2.54.0
From 921c2e145f081c6acc05e6da2f0d14ac747d2cf0 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <[email protected]> Date: Tue, 28 Apr 2026 12:21:21 -0700 Subject: [PATCH v1] Fix race between ProcSignalInit() and EmitProcSignalBarrier(). Previously, ProcSignalInit() read the global barrier generation before publishing its PID intopss_pid. This created a race condition: a process could initialize its local generation with an older global value, while a concurrent EmitProcSignalBarrier() might skip that process because its pss_pid was still zero. This resulted in WaitForProcSignalBarrier() hanging indefinitely. Fix this by publishing pss_pid before reading psh_barrierGeneration with a memory barrier so that the store to pss_pid is ordered before the load. A concurrent EmitProcSignalBarrier() then either observes the published PID and signals this slot, or completes its generation increment before we load it. While this race has become more visible due to recent features using signal barriers in more places (such as online wal_level changes), the issue is theoretically present since signal barriers were introduced to release smgr caches (e.g., in DROP DATABASE). v14 has the procsiangl barrier infrastricutre but no in-tree caller that actually emits a barrier, so the case is unreachable there. This issue was also reported by buildfarm member flaviventris. Reported-by: Melanie Plageman <[email protected]> Reviewed-by: Alexander Lakhin <[email protected]> Reviewed-by: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com Backpatch-through: 15 --- src/backend/storage/ipc/procsignal.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 05d99b452c3..e7c9da2b940 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -185,6 +185,15 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) /* Clear out any leftover signal reasons */ MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t)); + /* + * Publish the PID before reading the global barrier generation to ensure + * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an + * older generation. We need a memory barrier here to make sure that the + * update of pss_pid is ordered before the subsequent load of + * psh_barrierGeneration. + */ + pg_atomic_write_membarrier_u32(&slot->pss_pid, MyProcPid); + /* * Initialize barrier state. Since we're a brand-new process, there * shouldn't be any leftover backend-private state that needs to be @@ -204,7 +213,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) if (cancel_key_len > 0) memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len); slot->pss_cancel_key_len = cancel_key_len; - pg_atomic_write_u32(&slot->pss_pid, MyProcPid); SpinLockRelease(&slot->pss_mutex); -- 2.54.0
