On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
> then this is really just one instance of a bug that's probably pretty
> widespread in our signal handlers.  Somebody needs to go through 'em
> all and look for touches of shared memory.

I haven't made a comprehensive study of every signal handler we have,
but looking at procsignal_sigusr1_handler, the list of functions that
can get called from here is quite short: CheckProcSignal(),
RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler().
Taking those in reverse order:

- latch_sigusr1_handler() is fine.  Nothing down this path touches
shared memory; moreover, if we've already disowned our latch, the
waiting flag won't be set and this will do nothing at all.
- The call to SetLatch() is problematic as we already know.  This is
new code in 9.4.
- RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is
set.  So it's fine.
- CheckProcSignal() also appears problematic.  If we've already
detached shared memory, MyProcSignalSlot will be pointing to garbage,
but we'll try to dereference it anyway.

I think maybe the best fix is to *clear* MyProc in
ProcKill/AuxiliaryProcKill and MyProcSignalSlot in
CleanupProcSignalState, as shown in the attached patch.  Most places
that dereference those pointers already check that they aren't null,
and we can easily add a NULL guard to the SetLatch() call in
procsignal_sigusr1_handler, which the attached patch also does.

This might not be a complete fix to every problem of this type that
exists anywhere in our code, but I think it's enough to make the world
safe for procsignal_sigusr1_handler.  We also have a *large* number of
signal handlers that do little more than this:

    if (MyProc)
        SetLatch(&MyProc->procLatch);

...and this change should make all of those safe as well.  So I think
this is a pretty good start.

Assuming nobody objects too much to this basic approach, should I
back-patch the parts of this that apply pre-9.4?  The problem with
CleanupProcSignalState, at least, goes all the way back to 9.0, when
the signal-multiplexing infrastructure was introduced.  But the
probability of an actual crash must be pretty low, or I imagine we
would have noticed this sooner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 6ebabce..1372a7e 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -149,6 +149,8 @@ CleanupProcSignalState(int status, Datum arg)
 	slot = &ProcSignalSlots[pss_idx - 1];
 	Assert(slot == MyProcSignalSlot);
 
+	MyProcSignalSlot = NULL;
+
 	/* sanity check */
 	if (slot->pss_pid != MyProcPid)
 	{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ee6c24c..f64e1c4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -772,6 +772,7 @@ ProcKill(int code, Datum arg)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile PROC_HDR *procglobal = ProcGlobal;
+	PGPROC	   *proc;
 
 	Assert(MyProc != NULL);
 
@@ -796,31 +797,34 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
-	/* Release ownership of the process's latch, too */
-	DisownLatch(&MyProc->procLatch);
+	/*
+	 * Clear MyProc first; then disown the process latch.  This is so that
+	 * signal handlers won't try to clear the process latch after it's no
+	 * longer ours.
+	 */
+	proc = MyProc;
+	MyProc = NULL;
+	DisownLatch(&proc->procLatch);
 
 	SpinLockAcquire(ProcStructLock);
 
 	/* Return PGPROC structure (and semaphore) to appropriate freelist */
 	if (IsAnyAutoVacuumProcess())
 	{
-		MyProc->links.next = (SHM_QUEUE *) procglobal->autovacFreeProcs;
-		procglobal->autovacFreeProcs = MyProc;
+		proc->links.next = (SHM_QUEUE *) procglobal->autovacFreeProcs;
+		procglobal->autovacFreeProcs = proc;
 	}
 	else if (IsBackgroundWorker)
 	{
-		MyProc->links.next = (SHM_QUEUE *) procglobal->bgworkerFreeProcs;
-		procglobal->bgworkerFreeProcs = MyProc;
+		proc->links.next = (SHM_QUEUE *) procglobal->bgworkerFreeProcs;
+		procglobal->bgworkerFreeProcs = proc;
 	}
 	else
 	{
-		MyProc->links.next = (SHM_QUEUE *) procglobal->freeProcs;
-		procglobal->freeProcs = MyProc;
+		proc->links.next = (SHM_QUEUE *) procglobal->freeProcs;
+		procglobal->freeProcs = proc;
 	}
 
-	/* PGPROC struct isn't mine anymore */
-	MyProc = NULL;
-
 	/* Update shared estimate of spins_per_delay */
 	procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
 
@@ -849,6 +853,7 @@ AuxiliaryProcKill(int code, Datum arg)
 {
 	int			proctype = DatumGetInt32(arg);
 	PGPROC	   *auxproc PG_USED_FOR_ASSERTS_ONLY;
+	PGPROC	   *proc;
 
 	Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
 
@@ -859,16 +864,19 @@ AuxiliaryProcKill(int code, Datum arg)
 	/* Release any LW locks I am holding (see notes above) */
 	LWLockReleaseAll();
 
-	/* Release ownership of the process's latch, too */
-	DisownLatch(&MyProc->procLatch);
+	/*
+	 * Clear MyProc first; then disown the process latch.  This is so that
+	 * signal handlers won't try to clear the process latch after it's no
+	 * longer ours.
+	 */
+	proc = MyProc;
+	MyProc = NULL;
+	DisownLatch(&proc->procLatch);
 
 	SpinLockAcquire(ProcStructLock);
 
 	/* Mark auxiliary proc no longer in use */
-	MyProc->pid = 0;
-
-	/* PGPROC struct isn't mine anymore */
-	MyProc = NULL;
+	proc->pid = 0;
 
 	/* Update shared estimate of spins_per_delay */
 	ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to