On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, > then this is really just one instance of a bug that's probably pretty > widespread in our signal handlers. Somebody needs to go through 'em > all and look for touches of shared memory.
I haven't made a comprehensive study of every signal handler we have, but looking at procsignal_sigusr1_handler, the list of functions that can get called from here is quite short: CheckProcSignal(), RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler(). Taking those in reverse order: - latch_sigusr1_handler() is fine. Nothing down this path touches shared memory; moreover, if we've already disowned our latch, the waiting flag won't be set and this will do nothing at all. - The call to SetLatch() is problematic as we already know. This is new code in 9.4. - RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is set. So it's fine. - CheckProcSignal() also appears problematic. If we've already detached shared memory, MyProcSignalSlot will be pointing to garbage, but we'll try to dereference it anyway. I think maybe the best fix is to *clear* MyProc in ProcKill/AuxiliaryProcKill and MyProcSignalSlot in CleanupProcSignalState, as shown in the attached patch. Most places that dereference those pointers already check that they aren't null, and we can easily add a NULL guard to the SetLatch() call in procsignal_sigusr1_handler, which the attached patch also does. This might not be a complete fix to every problem of this type that exists anywhere in our code, but I think it's enough to make the world safe for procsignal_sigusr1_handler. We also have a *large* number of signal handlers that do little more than this: if (MyProc) SetLatch(&MyProc->procLatch); ...and this change should make all of those safe as well. So I think this is a pretty good start. Assuming nobody objects too much to this basic approach, should I back-patch the parts of this that apply pre-9.4? The problem with CleanupProcSignalState, at least, goes all the way back to 9.0, when the signal-multiplexing infrastructure was introduced. But the probability of an actual crash must be pretty low, or I imagine we would have noticed this sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 6ebabce..1372a7e 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -149,6 +149,8 @@ CleanupProcSignalState(int status, Datum arg) slot = &ProcSignalSlots[pss_idx - 1]; Assert(slot == MyProcSignalSlot); + MyProcSignalSlot = NULL; + /* sanity check */ if (slot->pss_pid != MyProcPid) { diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index ee6c24c..f64e1c4 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -772,6 +772,7 @@ ProcKill(int code, Datum arg) { /* use volatile pointer to prevent code rearrangement */ volatile PROC_HDR *procglobal = ProcGlobal; + PGPROC *proc; Assert(MyProc != NULL); @@ -796,31 +797,34 @@ ProcKill(int code, Datum arg) */ LWLockReleaseAll(); - /* Release ownership of the process's latch, too */ - DisownLatch(&MyProc->procLatch); + /* + * Clear MyProc first; then disown the process latch. This is so that + * signal handlers won't try to clear the process latch after it's no + * longer ours. + */ + proc = MyProc; + MyProc = NULL; + DisownLatch(&proc->procLatch); SpinLockAcquire(ProcStructLock); /* Return PGPROC structure (and semaphore) to appropriate freelist */ if (IsAnyAutoVacuumProcess()) { - MyProc->links.next = (SHM_QUEUE *) procglobal->autovacFreeProcs; - procglobal->autovacFreeProcs = MyProc; + proc->links.next = (SHM_QUEUE *) procglobal->autovacFreeProcs; + procglobal->autovacFreeProcs = proc; } else if (IsBackgroundWorker) { - MyProc->links.next = (SHM_QUEUE *) procglobal->bgworkerFreeProcs; - procglobal->bgworkerFreeProcs = MyProc; + proc->links.next = (SHM_QUEUE *) procglobal->bgworkerFreeProcs; + procglobal->bgworkerFreeProcs = proc; } else { - MyProc->links.next = (SHM_QUEUE *) procglobal->freeProcs; - procglobal->freeProcs = MyProc; + proc->links.next = (SHM_QUEUE *) procglobal->freeProcs; + procglobal->freeProcs = proc; } - /* PGPROC struct isn't mine anymore */ - MyProc = NULL; - /* Update shared estimate of spins_per_delay */ procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay); @@ -849,6 +853,7 @@ AuxiliaryProcKill(int code, Datum arg) { int proctype = DatumGetInt32(arg); PGPROC *auxproc PG_USED_FOR_ASSERTS_ONLY; + PGPROC *proc; Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS); @@ -859,16 +864,19 @@ AuxiliaryProcKill(int code, Datum arg) /* Release any LW locks I am holding (see notes above) */ LWLockReleaseAll(); - /* Release ownership of the process's latch, too */ - DisownLatch(&MyProc->procLatch); + /* + * Clear MyProc first; then disown the process latch. This is so that + * signal handlers won't try to clear the process latch after it's no + * longer ours. + */ + proc = MyProc; + MyProc = NULL; + DisownLatch(&proc->procLatch); SpinLockAcquire(ProcStructLock); /* Mark auxiliary proc no longer in use */ - MyProc->pid = 0; - - /* PGPROC struct isn't mine anymore */ - MyProc = NULL; + proc->pid = 0; /* Update shared estimate of spins_per_delay */ ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers