Hi,

On 2026-02-13 19:11:52 +0200, Heikki Linnakangas wrote:
> On 29/01/2026 03:05, Heikki Linnakangas wrote:
> > Here's a new rebased and massively re-worked patch.
> >
> > The patches are split differently than in the previous version:
> > - Patches 0001-0005 are just refactoring around recovery conflicts which
> > I posted on a separate thread [1]. Please review and comment there.
> > - Patches 0006 and 0007 are small cleanups that could be applied early
> > (after updating the docs, as noted in TODO comment there).
> > - All the interesting bits for this thread are in the last, massive patch.
> >
> > To review this, I suggest starting from the new src/backend/ipc/
> > README.md file. It gives a good overview of the mechanism (or if it
> > doesn't, that's valuable feedback :-) ). It also contains a bunch of
> > Open Questions at the bottom; I'd love to hear opinions and ideas on
> > those.
>
> New version attached. Lots of little cleanups, and a few more notable
> changes:
>
> - Switched to 64-bit interrupt masks. This gives more headroom for
> extensions to reserve their own custom interrupts

I don't think as done here that's quite legal - we can't rely on 64bit atomics
inside signal handlers, because they may be emulated with a spinlock. Which in
turn means that interrupting oneself while modifying a 64bit atomic could
result in a self-deadlock.

It may be possible to make it safe to do the 64bit atomics emulation signal
safe, but I think the complexity and the overhead would likely make that
problematic.

I guess I'd just emulate it by splitting the interrupt mask into two? Perhaps
with a special interrupt bit set in the first atomic indicating that the
second word has pending interrupts?


If we force everyone to touch all the WaitLatch() calls, how about we move to
a saner unit for timeout? For one, long is stupid, due to the platform
dependent length. For another, a millisecond is too long a sleep time for some
things.

I wonder if we really need timeout handlers in the current form anymore. What
timeouts now do is to just raise an interrupt that is then reacted to
subsequently. Seems we should just change RegisterTimeout to accept the
interrupt it should trigger?



> - I mostly gave up on sending interrupts from postmaster to child processes.
> So postmaster still uses kill() to tell child processes to exit or do other
> things. There's a TODO section in the README about that, but this is fine
> for now.
>
> - Implemented a RequestAddinInterrupt() function for extensions to request
> interrupt bits. (TODO: use it in one of the example extensions, to show how
> to use it and to test that it works)



> From 3a81c457dc2b7daf89b8cdf4f058c1aaa52342da Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <[email protected]>
> Date: Fri, 13 Feb 2026 13:19:47 +0200
> Subject: [PATCH 1/5] Ignore SIGINT in walwriter and walsummarizer
>
> Previously, SIGINT was treated the same as SIGTERM in walwriter and
> walsummarizer. That decision goes back to when the walwriter process
> was introduced (commit ad4295728e04), and was later copied to
> walsummarizer. It was a pretty arbitrary decision back then, and we
> haven't adopted that convention in all the other processes that have
> been introduced later.
>
> Summary of how other processes respond to SIGINT:
> - Autovacuum launcher: Cancel the current iteration of launching
> - bgworker: Ignore (unless connected to a database)
> - checkpointer: Request shutdown checkpoint
> - pgarch: Ignore
> - startup process: Ignore
> - walreceiver: Ignore
> - IO worker: die()

- bgwriter: Ignore



> IO workers are a notable exception in that they exit on SIGINT, and
> there's a documented reason for that: IO workers ignore SIGTERM, so
> SIGINT provides a way to manually kill them. (They do respond to
> SIGUSR2, though, like all the other processes that we don't want to
> exit immediately on SIGTERM on operating system shutdown.)
>
> To make this a little more consistent, ignore SIGINT in walwriter and
> walsummarizer. They have no "query" to cancel, and they react to
> SIGTERM just fine.

WFM.  Let's get this merged.


> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
> index 21de158adbb..b88ce440b1d 100644
> --- a/src/backend/tcop/postgres.c
> +++ b/src/backend/tcop/postgres.c
> @@ -4285,13 +4285,6 @@ PostgresMain(const char *dbname, const char *username)
>               pqsignal(SIGUSR1, procsignal_sigusr1_handler);
>               pqsignal(SIGUSR2, SIG_IGN);
>               pqsignal(SIGFPE, FloatExceptionHandler);
> -
> -             /*
> -              * Reset some signals that are accepted by postmaster but not by
> -              * backend
> -              */
> -             pqsignal(SIGCHLD, SIG_DFL); /* system() requires this on some
> -                                                                      * 
> platforms */
>       }
>
>       /* Early initialization */

For a moment I wondered if there's a problem for single user mode, which also
goes through this path: But I think not, because that will never have set the
SIGCHLD handler to a non-default value, and the default is ignore...


> From 63d1a57f4906a924c426def4e1a7f27a71611b28 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <[email protected]>
> Date: Tue, 20 Jan 2026 16:57:25 +0200
> Subject: [PATCH 3/5] Use standard die() handler for SIGTERM in bgworkers
> -/*
> - * Standard SIGTERM handler for background workers
> - */
> -static void
> -bgworker_die(SIGNAL_ARGS)
> -{
> -     sigprocmask(SIG_SETMASK, &BlockSig, NULL);
> -
> -     ereport(FATAL,
> -                     (errcode(ERRCODE_ADMIN_SHUTDOWN),
> -                      errmsg("terminating background worker \"%s\" due to 
> administrator command",
> -                                     MyBgworkerEntry->bgw_type)));
> -}
> -
>  /*
>   * Main entry point for background worker processes.
>   */

Uh, huh. So we were defaulting to completely unsafe code in bgworkers all this
time?  This obviously can self-deadlock against memory allocations etc in the
interrupted code... Or cause confusion with the IO streams for stderr. Or ...

We really need some instrumentation that fails if we do allocations in signal
handlers etc.



> From c18137939a0d2561fedc247a3f3135355923ffa8 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <[email protected]>
> Date: Fri, 13 Feb 2026 14:23:35 +0200
> Subject: [PATCH 4/5] Refactor how some aux processes advertise their
>  ProcNumber
>
> This moves the responsibility of setting the
> ProcGlobal->walrewriterProc and checkpointerProc fields to
> InitAuxiliaryProcess. Also switch to the same pattern to advertise the
> autovacuum launcher's ProcNumber, replacing the ad hoc av_launcherpid
> field in shared memory. This can easily be extended to other aux
> processes in the future, if other processes need to find them.
>
> Switch to pg_atomic_uint32 for the fields. Seems easier to reason
> about than volatile pointers. There was some precedence for that, as
> were already using pg_atomic_uint32 for the procArrayGroupFirst and
> clogGroupFirst fields, which also store ProcNumbers.

> TODO: could also replace WalRecv->procno with this
> ---
>  src/backend/access/transam/xlog.c     |  3 +--
>  src/backend/postmaster/autovacuum.c   | 19 +++++++++--------
>  src/backend/postmaster/checkpointer.c | 14 ++++---------
>  src/backend/postmaster/walwriter.c    |  6 ------
>  src/backend/storage/lmgr/proc.c       | 30 +++++++++++++++++++++++++--
>  src/include/storage/proc.h            |  7 ++++---
>  6 files changed, 47 insertions(+), 32 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index 13ec6225b85..8e2e4659974 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2653,8 +2653,7 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
>
>       if (wakeup)
>       {
> -             volatile PROC_HDR *procglobal = ProcGlobal;
> -             ProcNumber      walwriterProc = procglobal->walwriterProc;
> +             ProcNumber      walwriterProc = 
> pg_atomic_read_u32(&ProcGlobal->walwriterProc);
>
>               if (walwriterProc != INVALID_PROC_NUMBER)
>                       SetLatch(&GetPGProcByNumber(walwriterProc)->procLatch);

It's not even clear what the volatile is trying to achieve here. Sure, it
prevents the compiler from eliding the read, but a) there's not really a
chance for that and b) it doesn't actually include a memory barrier, so before
and now this can be an outdated value...

I think that's ok, the spinlock further up provides enough of a barrier, but
it made the volatile useless anyway.




> @@ -1547,8 +1543,14 @@ AutoVacWorkerMain(const void *startup_data, size_t 
> startup_data_len)
>               on_shmem_exit(FreeWorkerInfo, 0);
>
>               /* wake up the launcher */
> -             if (AutoVacuumShmem->av_launcherpid != 0)
> -                     kill(AutoVacuumShmem->av_launcherpid, SIGUSR2);
> +             launcherProc = pg_atomic_read_u32(&ProcGlobal->avLauncherProc);
> +             if (launcherProc != INVALID_PROC_NUMBER)
> +             {
> +                     int                     pid = 
> GetPGProcByNumber(launcherProc)->pid;
> +

Maybe we should wrap that combination into a helper? Although I guess we're
going to try to get rid of the signals later anyway...


> @@ -709,6 +710,14 @@ InitAuxiliaryProcess(void)
>        */
>       PGSemaphoreReset(MyProc->sem);
>
> +     /* Some aux processes are also advertised in ProcGlobal */
> +     if (MyBackendType == B_AUTOVAC_LAUNCHER)
> +             pg_atomic_write_u32(&ProcGlobal->avLauncherProc, MyProcNumber);
> +     if (MyBackendType == B_WAL_WRITER)
> +             pg_atomic_write_u32(&ProcGlobal->walwriterProc, MyProcNumber);
> +     if (MyBackendType == B_CHECKPOINTER)
> +             pg_atomic_write_u32(&ProcGlobal->checkpointerProc, 
> MyProcNumber);
> +

This would look cleaner if we went with my proposal for a generic
ProcGlobal->auxProc[proctype] approach :)


> Now that there are separate bits for different things, it's possible
> to do e.g. WaitInterrupt(INTERRUPT_CONFIG_RELOAD |
> INTERRUPT_QUERY_CANCEL) to wait just for a config reload request or
> query cancellation, and not wake up on other events like a shutdown
> request. That's not very useful in practice, as CHECK_FOR_INTERRUPTS()
> still processes those interrupts if you call it, and you would want to
> process all the standard interrupts promptly anyway.

I think it could be useful. There are some cases where we wake up too often to
effectively do nothing, because we immediately clear the latch after being
woken up, which then leads to being signalled again. If we avoid really waking
up and processing/clearing interrupt bits, we'd not get signalled again.


> diff --git a/contrib/pg_prewarm/autoprewarm.c 
> b/contrib/pg_prewarm/autoprewarm.c
> index 89e187425cc..6be5ca9c183 100644
> --- a/contrib/pg_prewarm/autoprewarm.c
> +++ b/contrib/pg_prewarm/autoprewarm.c
> @@ -30,17 +30,15 @@
>
>  #include "access/relation.h"
>  #include "access/xact.h"
> +#include "ipc/interrupt.h"
>  #include "pgstat.h"
>  #include "postmaster/bgworker.h"
> -#include "postmaster/interrupt.h"

I'd do this part in a separate patch. Just because this commit is quite large
and anything that we can merge earlier on is a win.


> @@ -170,10 +168,13 @@ autoprewarm_main(Datum main_arg)
>       bool            final_dump_allowed = true;
>       TimestampTz last_dump_time = 0;
>
> -     /* Establish signal handlers; once that's done, unblock signals. */
> -     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
> -     pqsignal(SIGHUP, SignalHandlerForConfigReload);
> -     pqsignal(SIGUSR1, procsignal_sigusr1_handler);
> +     /*
> +      * We will check for INTERRUPT_TERMINATE explicitly in the main loop, so
> +      * disable the normal interrupt handler.
> +      */
> +     DisableInterrupt(INTERRUPT_TERMINATE);
> +
> +     /* Default signal handlers are OK for us; unblock signals. */
>       BackgroundWorkerUnblockSignals();

I also wonder if the whole signal handler replacement bit would better be done
in a followup commit.


>       /* Periodically dump buffers until terminated. */
> -     while (!ShutdownRequestPending)
> +     while (!InterruptPending(INTERRUPT_TERMINATE))
>       {
> -             /* In case of a SIGHUP, just reload the configuration. */
> -             if (ConfigReloadPending)
> -             {
> -                     ConfigReloadPending = false;
> +             /* Check for standard interrupts and config reload */
> +             CHECK_FOR_INTERRUPTS();
> +             if (ConsumeInterrupt(INTERRUPT_CONFIG_RELOAD))
>                       ProcessConfigFile(PGC_SIGHUP);
> -             }

This is probably for later, but I wonder if we ought to handle the
pending-config-reload by reacting to it via CFI(), and, in the processes that
don't want to do so immediately, defer processing it by masking the interrupt
most of the time.

Part of the motivation is this:

The current handling in backend processes IMO is somewhat wrong: We don't
reload the config until the next command arrives - which means that if you
e.g. adjust the idle session timeout, the new value won't be applied until the
next command arrives.


> @@ -2035,8 +2035,11 @@ spgdoinsert(Relation index, SpGistState *state,
>                * pending, break out of the loop and deal with the situation 
> below.
>                * Set result = false because we must restart the insertion if 
> the
>                * interrupt isn't a query-cancel-or-die case.
> +              *
> +              * FIXME: CheckForInterruptsMask covers more than just query 
> cancel
> +              * and die.  Could we be more precise here?
>                */
> -             if (INTERRUPTS_PENDING_CONDITION())
> +             if (INTERRUPTS_PENDING_CONDITION(CheckForInterruptsMask))
>               {
>                       result = false;
>                       break;

I'm not quite following - previously we looked at InterruptPending, now we
check if any interrupt in CheckForInterruptsMask is set. Neither is just query
cancel and die, right?


> @@ -4500,11 +4473,11 @@ CheckForStandbyTrigger(void)
>       if (LocalPromoteIsTriggered)
>               return true;
>
> -     if (IsPromoteSignaled() && CheckPromoteSignal())
> +     if (InterruptPending(INTERRUPT_CHECK_PROMOTE) && CheckPromoteSignal())
>       {
>               ereport(LOG, (errmsg("received promote request")));
>               RemovePromoteSignalFiles();
> -             ResetPromoteSignaled();
> +             ClearInterrupt(INTERRUPT_CHECK_PROMOTE);
>               SetPromoteIsTriggered();
>               return true;
>       }

I think we need to clear the pending INTERRUPT_CHECK_PROMOTE regardless of
CheckPromoteSignal() being set, otherwise we'll do this over and over. In the
case of WaitForWALToBecomeAvailable() I think it'd turn into a busy loop.


> @@ -4542,7 +4515,10 @@ CheckPromoteSignal(void)
>  void
>  WakeupRecovery(void)
>  {
> -     SetLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
> +     ProcNumber      startupProc = 
> pg_atomic_read_u32(&ProcGlobal->startupProc);
> +
> +     if (startupProc != INVALID_PROC_NUMBER)
> +             SendInterrupt(INTERRUPT_WAL_ARRIVED, startupProc);
>  }
>
>  /*

> @@ -298,14 +298,13 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr 
> currentLSN)
>               LWLockRelease(WaitLSNLock);
>
>               /*
> -              * Set latches for processes whose waited LSNs have been 
> reached.
> -              * Since SetLatch() is a time-consuming operation, we do this 
> outside
> -              * of WaitLSNLock. This is safe because procLatch is never 
> freed, so
> -              * at worst we may set a latch for the wrong process or for no 
> process
> -              * at all, which is harmless.
> +              * Wake up processes whose waited LSNs have been reached.  Since
> +              * SendInterrupt is a time-consuming operation, we do this 
> outside of
> +              * WaitLSNLock. (If the backend exits or is no longer waiting, 
> we'll
> +              * send spurious wakeup, but that's harmless)
>                */
>               for (j = 0; j < numWakeUpProcs; j++)
> -                     SetLatch(&GetPGProcByNumber(wakeUpProcs[j])->procLatch);
> +                     SendInterrupt(INTERRUPT_WAIT_WAKEUP, wakeUpProcs[j]);
>

Orthogonal: Kinda wondering if interupts (or latches) are really the right
mechanism for this, it's not just the startup process that can be interested
in new WAL arriving.  Recovery readahead, logical decoding, wait-for-lsn,
eventually the WAL writer (once we fix the stupidiy of walreceiver always
fsyncing)...

This is kinda a poor person's emulation of a condition variable.


> @@ -146,7 +146,7 @@ throttle(bbsink_throttle *sink, size_t increment)
>               (sink->throttling_counter / sink->throttling_sample);
>
>       /*
> -      * Since the latch could be set repeatedly because of concurrently WAL
> +      * Since the interrupt could be set repeatedly because of concurrently 
> WAL
>        * activity, sleep in a loop to ensure enough time has passed.
>        */
>       for (;;)
> @@ -163,22 +163,16 @@ throttle(bbsink_throttle *sink, size_t increment)
>               if (sleep <= 0)
>                       break;
>
> -             ResetLatch(MyLatch);
> -
> -             /* We're eating a potentially set latch, so check for 
> interrupts */
>               CHECK_FOR_INTERRUPTS();
>
>               /*
>                * (TAR_SEND_SIZE / throttling_sample * elapsed_min_unit) 
> should be
>                * the maximum time to sleep. Thus the cast to long is safe.
>                */
> -             wait_result = WaitLatch(MyLatch,
> -                                                             WL_LATCH_SET | 
> WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> -                                                             (long) (sleep / 
> 1000),
> -                                                             
> WAIT_EVENT_BASE_BACKUP_THROTTLE);
> -
> -             if (wait_result & WL_LATCH_SET)
> -                     CHECK_FOR_INTERRUPTS();
> +             wait_result = WaitInterrupt(CheckForInterruptsMask,
> +                                                                     
> WL_INTERRUPT | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +                                                                     (long) 
> (sleep / 1000),
> +                                                                     
> WAIT_EVENT_BASE_BACKUP_THROTTLE);
>
>               /* Done waiting? */
>               if (wait_result & WL_TIMEOUT)

Not entirely sure the removal of the second CFI() is quite right - if you got 
both
an interrupt and a timeout (not likely, I know), you'd not process the
interupt anymore, because of the "Done waiting?" block.


> @@ -2571,12 +2510,13 @@ HandleNotifyInterrupt(void)
>  void
>  ProcessNotifyInterrupt(bool flush)
>  {
> -     if (IsTransactionOrTransactionBlock())
> -             return;                                 /* not really idle */
> +     Assert(!IsTransactionOrTransactionBlock());
>
> -     /* Loop in case another signal arrives while sending messages */
> -     while (notifyInterruptPending)
> +     /* Loop in case another interrupt arrives while sending messages */
> +     do
> +     {
>               ProcessIncomingNotify(flush);
> +     } while (ConsumeInterrupt(INTERRUPT_ASYNC_NOTIFY));
>  }

At first I thought this would do one ProcessIncomingNotify() too many. But I
guess the logic is that the caller would already have cleared the interrupt
bit?  A bit subtle without a comment.


> @@ -2430,8 +2430,7 @@ vacuum_delay_point(bool is_analyze)
>       /* Always check for interrupts */
>       CHECK_FOR_INTERRUPTS();
>
> -     if (InterruptPending ||
> -             (!VacuumCostActive && !ConfigReloadPending))
> +     if (!VacuumCostActive && !InterruptPending(INTERRUPT_CONFIG_RELOAD))
>               return;
>
>       /*
> @@ -2440,9 +2439,9 @@ vacuum_delay_point(bool is_analyze)
>        * [autovacuum_]vacuum_cost_delay to take effect while a table is being
>        * vacuumed or analyzed.
>        */
> -     if (ConfigReloadPending && AmAutoVacuumWorkerProcess())
> +     if (InterruptPending(INTERRUPT_CONFIG_RELOAD) && 
> AmAutoVacuumWorkerProcess())
>       {
> -             ConfigReloadPending = false;
> +             ClearInterrupt(INTERRUPT_CONFIG_RELOAD);
>               ProcessConfigFile(PGC_SIGHUP);
>               VacuumUpdateCosts();
>       }

I was worried for a moment that this would end up leading to busy looping if
INTERRUPT_CONFIG_RELOAD is pending, while not in an autovac process. But I
guess it's ok, because the sleep is done via pg_usleep() (gah), which won't
cut its sleep short if there are pending interrupts.



> +HOLD/RESUME_INTERRUPTS
> +----------------------
> +
> +Processing an interrupt may throw an error with ereport() (for
> +example, query cancellation) or terminate the process gracefully. In
> +some cases, we invoke CHECK_FOR_INTERRUPTS() inside low-level
> +subroutines that might sometimes be called in contexts that do *not*
> +want to allow interrupt processing.  The HOLD_INTERRUPTS() and
> +RESUME_INTERRUPTS() macros allow code to ensure that no interrupt
> +handlers are called, even if CHECK_FOR_INTERRUPTS() gets called in a
> +subroutine.  The interrupt will be held off until
> +CHECK_FOR_INTERRUPTS() is done outside any HOLD_INTERRUPTS() ...
> +RESUME_INTERRUPTS() section.
> +
> +HOLD_INTERRUPTS() blocks can be nested. If HOLD_INTERRUPTS() is called
> +multiple times, interrupts are held until every HOLD_INTERRUPTS() call
> +has been balanced with a RESUME_INTERRUPTS() call.

I wonder if HOLD_INTERRUPTS() really makes sense anymore. ISTM that a good bit
of the time we really want to disable some interrupts from being
processed. This patch allows for that - but does that by basically
saving/restoring the interrupt mask in user code.

Perhaps we should make interrupt masking be more stack like? So that you can
do something like

PUSH_INTERRUPT_MASK(QUERY_CANCEL);
/* code that doesn't want query cancellation */
POP_INTERRUPT_MASK();

That would require a stack of masks, instead of just a counter, but I think
that may be ok? We'd only very rarely need to extend it.


> +Sending interrupts
> +------------------
> +
> +Interrupt flags can be "raised" in different ways:
> +- synchronously by code that wants to defer an action, by calling
> +  RaiseInterrupt,
> +- asynchronously by timers or other signal handlers, also by calling
> +  RaiseInterrupt, or
> +- "sent" by other backends with SendInterrupt

It's not clear to me what synchronously and asynchronously means here.


> +Unix Signals
> +============
> +
> +Postmaster and most child processes also respond to a few standard
> +Unix signals:
> +
> +SIGHUP -> Raises INTERRUPT_CONFIG_RELOAD
> +SIGINT -> Raises INTERRUPT_QUERY_CANCEL
> +SIGTERM -> Raises INTERRUPT_TERMINATE
> +SIGQUIT -> immediate shutdown, abort the process
> +
> +pg_ctl uses these Unix signals to tell postmaster to reload config,
> +stop, etc. These are also mentioned in the user documentation.
> +
> +Postmaster cannot send interrupts to processes without PGPROC entries
> +(just syslogger nowadays), and it doesn't know the PGPROC entries of
> +other child processes anyway. Hence, it still uses the above Unix
> +signals for postmaster -> child signaling. The only exception is when
> +postmaster notifies a backend that a bgworker it launched has
> +exited. Postmaster sends that interrupt directly. The backend
> +registers explicitly for that notification, and supplies the
> +ProcNumber to postmaster when registering.

Hm, seems a bit weird to have a bespoke mechanism for bgworkers. But I guess
they'd need special infrastructure anyway, as, even if postmaster knew the
pgprocs of child processes, the launcher's proc number can't directly be
inferred from the bgworker's proc.


> +TODO: Open questions
> +====================
> +
> +TODO: Put pendingInterrupts in PMChildSlot or PGPROC?
> +----------------------------------------------------
> +
> +Postmaster cannot send interrupts to processes without PGPROC
> +entries. Hence, it still uses plain Unix signals. Would be nice if
> +postmaster could send interrupts directly. If we moved the interrupt
> +mask to PMChildSlot, it could.  However, PGPROC seems like a more
> +natural location otherwise.
> +
> +Options:
> +
> +a) Move pendingInterrupts to PMChildSlot. That way, you can send signals
> +   to processes also to processes that don't have a PGPROC entry.
> +
> +b) Add a pendingInterrupts mask to PMChildSlot, but also have it in PGPROC.
> +   When postmaster sens an interrupt, it can check if it has a PGPROC entry,
> +   and if not, use the pendingInterrupts field in PMChildSlot instead.

Hm. Do we actually ever process interrupts before we have a pgproc? I think we
have just about all signals masked before that?


> +c) Assign a PGPROC entry for every child process in postmaster already.
> +   (Except dead-end backends).
> +
> +d) Keep it as it is, continue to use signals for postmaster -> child
> +   signaling

e) update PMChildSlot in Init[Auxiliary]Process(), [Auxiliary]ProcKill(), so
   that postmaster knows how to get the PGPROC?


> +TODO: Unique session id
> +----------------------
> +
> +In some places, we read the pid of a process (from PGPROC or
> +elsewhere), and send signal to it, accepting that the process may have
> +already exited.  If we directly replace those uses of 'pid' with a
> +ProcNumber, the ProcNumber might get reused much faster than the pid
> +would. Solutions:
> +
> +a) Introduce the concept of a unique session ID that is never recycled
> +   (or not for a long time, anyway). When reading the ProcNumber to
> +   send an interrupt to, also read the session ID. When sending the
> +   interrupt, check that the session ID matches.

How about we make that session ID something like a combination of the proc
number and a generation? It seems useful to be able to efficiently get the
proc number from the session id.


> +- Check performance of CHECK_FOR_INTERRUPTS(). It's very cheap, but
> +  it's nevertheless more instructions than it used to be.

Yea, it looks like it's a bit more expensive now. In an optimized build it
boils down to:

/* load address of CheckForInterruptsMask & MyPendingInterrupts */
mov    $0x1404c48,%r15
mov    $0x141cdc8,%rax
/* load value of CheckForInterruptsMask & MyPendingInterrupts */
mov    (%rax),%rdx
mov    (%r15),%rax
/* test if CheckForInterruptsMask == *MyPendingInterrupts */
test   %rdx,(%rax)
jne    160

The two address loads are annoying. I think you should put the various
interrupt related variables into one struct. That way both variables can be
loaded with from one address (with an offset in the load), making the code
denser and reducing register pressure a bit.  I checked and the code is a bit
nicer that way.

The indirection via MyPendingInterrupts is new overhead, but I don't
immediately have a better idea.


> +/*
> + * Currently installed interrupt handlers
> + */
> +static pg_interrupt_handler_t interrupt_handlers[64];

Probably worth defining that somewhere, instead of using a plain 64 here.


> +/*
> + * XXX: is 'volatile' still needed on all the variables below? Which ones are
> + * accessed from signal handlers?
> + */

There's different reasons volatile is needed for variables in signal handlers:

1) for modifications outside of signal handlers, the modification needs to
   actually be written to memory, instead of just being kept in a register

2) for modifications inside signal handlers, the code outside of signal
   handlers needs to always read the variable from memory, instead of
   potentially from a register.

I don't think we have to worry about 2) anymore, as the only variable that is
modified from within signal handlers is MyPendingInterrupts, and that already
ensures that the data is immediately written to memory by using an atomic.

For 1), I don't think we care either, because afaict we don't read any of the
relevant variables in signal handlers?


> +/*
> + * Install an interrupt handler callback function for the given interrupt.
> + *
> + * You need to also enable the interrupt with EnableInterrupt(), unless 
> you're
> + * replacing an existing handler function.
> + */
> +void
> +SetInterruptHandler(InterruptMask interruptMask, pg_interrupt_handler_t 
> handler)
> +{

Hm, I think we should assert that interruptMask only has a single bit set?
Because i don't think this will work right if multiple bits are set? And then
perhaps rename the argument, so that it's clearer that we're just dealing with
a single interrupt?


> +     /*
> +      * XXX: It's somewhat inefficient to loop through all the bits, but this
> +      * isn't performance critical.
> +      */
> +     for (int i = 0; i < lengthof(interrupt_handlers); i++)
> +     {
> +             if ((interruptMask & UINT64_BIT(i)) != 0)
> +             {
> +                     /* Replace old handler */
> +                     interrupt_handlers[i] = handler;
> +             }
> +     }
> +}
> +

Maybe I'm daft, but isn't the relevant offset in interrupt_handlers[] the bit
that is being set? So pg_rightmost_one_pos64() should do the trick? And
pg_popcount64() for detecting if there is more than one bit set.



> +/* Enable an interrupt to be processed by CHECK_FOR_INTERRUPTS() */
> +void
> +EnableInterrupt(InterruptMask interruptMask)
> +{

I guess this actually does work fine if multiple bits are set...


> +/*
> + * ProcessInterrupts: out-of-line portion of CHECK_FOR_INTERRUPTS() macro
> + *
> + * If an interrupt condition is pending, and it's safe to service it,
> + * then clear the flag and call the interrupt handler.
> + *
> + * Note: if INTERRUPTS_CAN_BE_PROCESSED(interrupt) is true, then
> + * ProcessInterrupts is guaranteed to clear the given interrupt before
> + * returning, if it was set when entering.  (This is not the same as
> + * guaranteeing that it's still clear when we return; another interrupt could
> + * have arrived.  But we promise that any pre-existing one will have been
> + * serviced.)
> + */
> +void
> +ProcessInterrupts(void)
> +{
> +     InterruptMask interruptsToProcess;
> +
> +     Assert(InterruptHoldoffCount == 0 && CritSectionCount == 0);

Perhaps also assert that CheckForInterruptsMask isn't 0?


> +
> +/*
> + * Switch to local interrupts.  Other backends can't send interrupts to this
> + * one.  Only RaiseInterrupt() can set them, from inside this process.
> + */

When do we actually need this? During BackendInitialize() the signal handlers
that are set up just do _exit(1) and thus don't need to signal
interrupts. After BackendInitialize(), but before InitProcess() all signals
are masked.

I guess it may not really matter, because we will still need the indirection
to PGPROC->interruptMask during CFI().


> +void
> +SwitchToLocalInterrupts(void)
> +{
> +     if (MyPendingInterrupts == &LocalPendingInterrupts)
> +             return;

ISTM that should be an assert, because something has seriously gone wrong if
that ever happens.


> +     MyPendingInterrupts = &LocalPendingInterrupts;
> +
> +     /*
> +      * Make sure that SIGALRM handlers that call RaiseInterrupt() are now
> +      * seeing the new MyPendingInterrupts destination.
> +      */
> +     pg_memory_barrier();

I don't think that can ever require a memory barrier - at most a compiler
barrier. Signal handlers and the main execution thread run with the same view
of the memory. The only possible inconsistency is when values are in
registers, rather than reading from memory.


> +     /*
> +      * Mix in the interrupts that we have received already in our shared
> +      * interrupt vector, while atomically clearing it.  Other backends may
> +      * continue to set bits in it after this point, but we've atomically
> +      * transferred the existing bits to our local vector so we won't get
> +      * duplicated interrupts later if we switch back.
> +      */

Switch back? We should never do that, right?


> +/*
> + * Set an interrupt flag in this backend.
> + *
> + * Note: This is called from signal handlers, so needs to be async-signal
> + * safe!
> + */
> +void
> +RaiseInterrupt(InterruptMask interruptMask)
> +{
> +     uint64          old_pending;
> +
> +     old_pending = pg_atomic_fetch_or_u64(MyPendingInterrupts, 
> interruptMask);
> +
> +     /*
> +      * If the process is currently blocked waiting for an interrupt to 
> arrive,
> +      * and the interrupt wasn't already pending, wake it up.
> +      */
> +     if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) == 
> SLEEPING_ON_INTERRUPTS)
> +             WakeupMyProc();
> +}

FWIW, on x86 fetch_or()/fetch_and() are more costly when the old value is
returned (because "lock or" doesn't return the old value, so it has to be
implemented as a CAS loop). And CAS loops are considerably more expensive than
a single atomic op.

I don't think SLEEPING_ON_INTERRUPTS could be set while RaiseInterrupt() is
running, so it should be ok to just read the variable to check for that.

Come to that, I think RaiseInterrupt() could use a non-modifying read to see
if the bit is already set and just not do an atomic op in that case? That
obviously won't work in SendInterrupt(), but here it may?


Could we have the mask of interrupts that WaitInterrupt() is waiting for in a
second variable? That way we could avoid interrupting WaitInterrupt() when
raising or sending a signal that WaitInterrupt() is not waiting for.  I think
that can be one race-freely with a bit of care?


> +/*
> + * Set an interrupt flag in another backend.
> + *
> + * Note: This can also be called from the postmaster, so be careful to not
> + * trust the contents of shared memory.
> + *
> + * FIXME: it's easy to accidentally swap the order of the args.  Could we 
> have
> + * stricter type checking?
> + */

We could wrap InteruptMask or ProcNumber in a struct, but both seem somewhat
painful :(.


> +void
> +SendInterrupt(InterruptMask interruptMask, ProcNumber pgprocno)
> +{

Should this actually be a mask, ather than a single value? Not sure I see
cases where it'd make sense to send multiple interrupts at once.


> +/*
> + * Wait for any of the interrupts in interruptMask to be set, or for
> + * postmaster death, or until timeout is exceeded. 'wakeEvents' is a bitmask
> + * that specifies which of those events to wait for. If the interrupt is
> + * already pending (and WL_INTERRUPT is given), the function returns
> + * immediately.

I think it's somewhat confusing that you can wait for interrupts, without
waiting for interrupts.  And having interrupts and wakeEvents in two
consecutive arguments also seems somewhat confusing :(.

When would you use WaitInterrupt() without WL_INTERRUPT?  Ugh, seems we have
some cases that use WaitInterruptOrSocket(0) during connection establishment.
We really ought to fix that, but this patch is probably not the time to do so.


> +/* Reserve an interrupt bit for use in an extension */
> +InterruptMask
> +RequestAddinInterrupt(void)
> +{
> +     InterruptMask result;
> +
> +     if (nextAddinInterruptBit == END_ADDIN_INTERRUPTS)
> +             elog(ERROR, "out of addin interrupt bits");
> +
> +     result = UINT64_BIT(nextAddinInterruptBit);
> +     nextAddinInterruptBit++;
> +     return result;
> +}

Hm. I think this either would need to ensure its only run in postmaster, or
have nextAddinInterruptBit in shared memory? Otherwise two extensions loaded
dynamically in different processes will get the same bit?


> diff --git a/src/backend/ipc/signal_handlers.c 
> b/src/backend/ipc/signal_handlers.c
> new file mode 100644
> index 00000000000..589f755622a
> --- /dev/null
> +++ b/src/backend/ipc/signal_handlers.c
> @@ -0,0 +1,160 @@
> +/*-------------------------------------------------------------------------
> + *
> + * signal_handlers.c
> + *     Standard signal handlers.

Why not introduce this in a preparatory commit?


> +/*
> + * Set the standard signal handlers suitable for most postmaster child
> + * processes.
> + *
> + * Note: this doesn't unblock the signals yet. You can make additional
> + * pqsignal() calls to modify the default behavior before unblocking.
> + */
> +void
> +SetPostmasterChildSignalHandlers(void)
> +{

PostgresMain() seems to redo a bunch of this? And not just for single user
mode, but always?


> --- a/src/backend/meson.build
> +++ b/src/backend/meson.build
> @@ -15,6 +15,7 @@ subdir('catalog')
>  subdir('commands')
>  subdir('executor')
>  subdir('foreign')
> +subdir('ipc')
>  subdir('jit')
>  subdir('lib')
>  subdir('libpq')

I'm all on board with a top-level ipc dir, but it seems somewhat odd to have
two ipc directories...


> +     /* We can safely reload config at any CHECK_FOR_INTERRUPTS() */
> +     SetInterruptHandler(INTERRUPT_CONFIG_RELOAD, 
> ProcessAutoVacLauncherConfigReloadInterrupt);

Hm. What if a second INTERRUPT_CONFIG_RELOAD arrives while autovac is in the
middle of ProcessAutoVacLauncherConfigReloadInterrupt? If there is any CFI()
burried within that, it could lead to
ProcessAutoVacLauncherConfigReloadInterrupt() being re-entered.

I think we may need to block interrupts while calling callbacks in
ProcessInterrupts().



Ran out of time & energy at this point...

But had already noticed this:

>
> -                     pgaio_debug_io(DEBUG4, staged_ios[i],
> +                     pgaio_debug_io(LOG, staged_ios[i],
>                                                  "choosing worker %d",
>                                                  worker);
>               }
>       }

Probably unintentional to have the s/DEBUG4/LOG/ here :)


Greetings,

Andres Freund


Reply via email to