Re: [HACKERS] a raft of parallelism-related bug fixes

Robert Haas Tue, 02 Feb 2016 12:43:05 -0800

On Mon, Oct 19, 2015 at 12:02 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sat, Oct 17, 2015 at 9:16 PM, Andrew Dunstan <and...@dunslane.net> wrote:
>> If all that is required is a #define, like CLOBBER_CACHE_ALWAYS, then no
>> special buildfarm support is required - you would just add that to the
>> animal's config file, more or less like this:
>>
>>      config_env =>
>>      {
>>          CPPFLAGS => '-DGRATUITOUSLY_PARALLEL',
>>      },
>>
>> I try to make things easy :-)
>
> Wow, that's great.  So, I'll try to rework the test code I posted
> previously into something less hacky, and eventually add a #define
> like this so we can run it on the buildfarm.  There's a few other
> things that need to get done before that really makes sense - like
> getting the rest of the bug fix patches committed - otherwise any
> buildfarm critters we add will just be permanently red.


OK, so after a bit more delay than I would have liked, I now have a
working set of patches that we can use to ensure automated testing of
the parallel mode infrastructure.  I ended up doing something that
does not require a #define, so I'll need some guidance on what to do
on the BF side given that context.  Please find attached three
patches, two of them for commit.

group-locking-v1.patch is a vastly improved version of the group
locking patch that we discussed, uh, extensively last year.  I realize
that there was a lot of doubt about this approach, but I still believe
it's the right approach, I have put a lot of work into making it work
correctly, I don't think anyone has come up with a really plausible
alternative approach (except one other approach I tried which turned
out to work but with significantly more restrictions), and I'm
committed to fixing it in whatever way is necessary if it turns out to
be broken, even if that amounts to a full rewrite.  Review is welcome,
but I honestly believe it's a good idea to get this into the tree
sooner rather than later at this point, because automated regression
testing falls to pieces without these changes, and I believe that
automated regression testing is a really good idea to shake out
whatever bugs we may have in the parallel query stuff.  The code in
this patch is all mine, but Amit Kapila deserves credit as co-author
for doing a lot of prototyping (that ended up getting tossed) and
testing.  This patch includes comments and an addition to
src/backend/storage/lmgr/README which explain in more detail what this
patch does, how it does it, and why that's OK.

force-parallel-mode-v1.patch is what adds the actual infrastructure
for automated testing.  You can set force_parallel_mode=on to force
queries to be ru in a worker whenever possible; this can help test
whether your user-defined functions have been erroneously labeled as
PARALLEL SAFE.  If they error out or misbehave with this setting
enabled, you should label them PARALLEL RESTRICTED or PARALLEL UNSAFE.
If you set force_parallel_mode=regress, then some additional changes
intended specifically for regression testing kick in; those changes
are intended to ensure that you get exactly the same output from
running the regression tests with the parallelism infrastructure
forcibly enabled that you would have gotten anyway.  Most of this code
is mine, but there are also contributions from Amit Kapila and Rushabh
Lathia.

With both of these patches, you can create a file that says:

force_parallel_mode=regress
max_parallel_degree=2

Then you can run: make check-world TEMP_CONFIG=/path/to/aforementioned/file

If you do, you'll find that while the core regression tests pass
(whee!) the pg_upgrade regression tests fail (oops) because of a
pre-existing bug in the parallelism code introduced by neither of
these two patches.  I'm not exactly sure how to fix that bug yet - I
have a couple of ideas - but I think the fact that this test code
promptly found a bug is good sign that it provides enough test
coverage to be useful.  Sticking a Gather node on top of every query
where it looks safe just turns out to exercise a lot of things: the
code that decides whether it's safe to put that Gather node, the code
to launch and manage parallel workers, the code those workers
themselves run, etc.  The point is just to force as much of the
parallel code to be used as possible even when it's not expected to
make anything faster.

test-group-locking-v1.patch is useful for testing possible deadlock
scenarios with the group locking patch.  It's not otherwise safe to
use this, like, at all, and the patch is not proposed for commit.
This patch is entirely by Amit Kapila.

In addition to what's in these patches, I'd like to add a new chapter
to the documentation explaining which queries can be parallelized and
in what ways, what the restrictions are that keep parallel query from
getting used, and some high-level details of how parallelism "works"
in PostgreSQL from a user perspective.  Things will obviously change
here as we get more capabilities, but I think we're at a point where
it makes sense to start putting this together.  What I'm less clear
about is where exactly in the current SGML documentation such a new
chapter might fit; suggestions very welcome.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From fec950b2d1e1686defb950ce95763b107bd2f656 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Sat, 3 Oct 2015 13:34:35 -0400
Subject: [PATCH 1/3] Introduce group locking to prevent parallel processes
 from deadlocking.

For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes.  There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach.  Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.

Robert Haas, with help from Amit Kapila.
---
 src/backend/access/transam/parallel.c |  16 ++
 src/backend/storage/lmgr/README       |  63 ++++++++
 src/backend/storage/lmgr/deadlock.c   | 279 +++++++++++++++++++++++++++-------
 src/backend/storage/lmgr/lock.c       | 122 ++++++++++++---
 src/backend/storage/lmgr/proc.c       | 158 ++++++++++++++++++-
 src/include/storage/lock.h            |  13 +-
 src/include/storage/proc.h            |  12 ++
 7 files changed, 587 insertions(+), 76 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8eea092..bf2e691 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -432,6 +432,9 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 	if (pcxt->nworkers == 0)
 		return;
 
+	/* We need to be a lock group leader. */
+	BecomeLockGroupLeader();
+
 	/* If we do have workers, we'd better have a DSM segment. */
 	Assert(pcxt->seg != NULL);
 
@@ -952,6 +955,19 @@ ParallelWorkerMain(Datum main_arg)
 	 */
 
 	/*
+	 * Join locking group.  We must do this before anything that could try
+	 * to acquire a heavyweight lock, because any heavyweight locks acquired
+	 * to this point could block either directly against the parallel group
+	 * leader or against some process which in turn waits for a lock that
+	 * conflicts with the parallel group leader, causing an undetected
+	 * deadlock.  (If we can't join the lock group, the leader has gone away,
+	 * so just exit quietly.)
+	 */
+	if (!BecomeLockGroupMember(fps->parallel_master_pgproc,
+							   fps->parallel_master_pid))
+		return;
+
+	/*
 	 * Load libraries that were loaded by original backend.  We want to do
 	 * this before restoring GUCs, because the libraries might define custom
 	 * variables.
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 8898e25..cb9c7d6 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -586,6 +586,69 @@ The caller can then send a cancellation signal.  This implements the
 principle that autovacuum has a low locking priority (eg it must not block
 DDL on the table).
 
+Group Locking
+-------------
+
+As if all of that weren't already complicated enough, PostgreSQL now supports
+parallelism (see src/backend/access/transam/README.parallel), which means that
+we might need to resolve deadlocks that occur between gangs of related processes
+rather than individual processes.  This doesn't change the basic deadlock
+detection algorithm very much, but it makes the bookkeeping more complicated.
+
+We choose to regard locks held by processes in the same parallel group as
+non-conflicting.  This means that two processes in a parallel group can hold
+a self-exclusive lock on the same relation at the same time, or one process
+can acquire an AccessShareLock while the other already holds AccessExclusiveLock.
+This might seem dangerous and could be in some cases (more on that below), but
+if we didn't do this then parallel query would be extremely prone to
+self-deadlock.  For example, a parallel query against a relation on which the
+leader had already AccessExclusiveLock would hang, because the workers would
+try to lock the same relation and be blocked by the leader; yet the leader can't
+finish until it receives completion indications from all workers.  An undetected
+deadlock results.  This is far from the only scenario where such a problem
+happens.  The same thing will occur if the leader holds only AccessShareLock,
+the worker seeks AccessShareLock, but between the time the leader attempts to
+acquire the lock and the time the worker attempts to acquire it, some other
+process queues up waiting for an AccessExclusiveLock.  In this case, too, an
+indefinite hang results.
+
+It might seem that we could predict which locks the workers will attempt to
+acquire and ensure before going parallel that those locks would be acquired
+successfully.  But this is very difficult to make work in a general way.  For
+example, a parallel worker's portion of the query plan could involve an
+SQL-callable function which generates a query dynamically, and that query
+might happen to hit a table on which the leader happens to hold
+AccessExcusiveLock.  By imposing enough restrictions on what workers can do,
+we could eventually create a situation where their behavior can be adequately
+restricted, but these restrictions would be fairly onerous, and even then, the
+system required to decide whether the workers will succeed at acquiring the
+necessary locks would be complex and possibly buggy.
+
+So, instead, we take the approach of deciding that locks within a lock group
+do not conflict.  This eliminates the possibility of an undetected deadlock,
+but also opens up some problem cases: if the leader and worker try to do some
+operation at the same time which would ordinarily be prevented by the heavyweight
+lock mechanism, undefined behavior might result.  In practice, the dangers are
+modest.  The leader and worker share the same transaction, snapshot, and combo
+CID hash, and neither can perform any DDL or, indeed, write any data at all.
+Thus, for either to read a table locked exclusively by the other is safe enough.
+Problems would occur if the leader initiated parallelism from a point in the
+code at which it had some backend-private state that made table access from
+another process unsafe, for example after calling SetReindexProcessing and
+before calling ResetReindexProcessing, catastrophe could ensue, because the
+worker won't have that state.  Similarly, problems could occur with certain
+kinds of non-relation locks, such as relation extension locks.  It's no safer
+for two related processes to extend the same relation at the time than for
+unrelated processes to do the same.  However, since parallel mode is strictly
+read-only at present, neither this nor most of the similar cases can arise at
+present.  To allow parallel writes, we'll either need to (1) further enhance
+the deadlock detector to handle those types of locks in a different way than
+other types; or (2) have parallel workers use some other mutual exclusion
+method for such cases; or (3) revise those cases so that they no longer use
+heavyweight locking in the first place (which is not a crazy idea, given that
+such lock acquisitions are not expected to deadlock and that heavyweight lock
+acquisition is fairly slow anyway).
+
 User Locks (Advisory Locks)
 ---------------------------
 
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index a68aaf6..69f678b 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -38,6 +38,7 @@ typedef struct
 {
 	PGPROC	   *waiter;			/* the waiting process */
 	PGPROC	   *blocker;		/* the process it is waiting for */
+	LOCK	   *lock;			/* the lock it is waiting for */
 	int			pred;			/* workspace for TopoSort */
 	int			link;			/* workspace for TopoSort */
 } EDGE;
@@ -72,6 +73,9 @@ static bool FindLockCycle(PGPROC *checkProc,
 			  EDGE *softEdges, int *nSoftEdges);
 static bool FindLockCycleRecurse(PGPROC *checkProc, int depth,
 					 EDGE *softEdges, int *nSoftEdges);
+static bool FindLockCycleRecurseMember(PGPROC *checkProc,
+						   PGPROC *checkProcLeader,
+						   int depth, EDGE *softEdges, int *nSoftEdges);
 static bool ExpandConstraints(EDGE *constraints, int nConstraints);
 static bool TopoSort(LOCK *lock, EDGE *constraints, int nConstraints,
 		 PGPROC **ordering);
@@ -449,18 +453,15 @@ FindLockCycleRecurse(PGPROC *checkProc,
 					 EDGE *softEdges,	/* output argument */
 					 int *nSoftEdges)	/* output argument */
 {
-	PGPROC	   *proc;
-	PGXACT	   *pgxact;
-	LOCK	   *lock;
-	PROCLOCK   *proclock;
-	SHM_QUEUE  *procLocks;
-	LockMethod	lockMethodTable;
-	PROC_QUEUE *waitQueue;
-	int			queue_size;
-	int			conflictMask;
 	int			i;
-	int			numLockModes,
-				lm;
+	dlist_iter	iter;
+
+	/*
+	 * If this process is a lock group member, check the leader instead. (Note
+	 * that we might be the leader, in which case this is a no-op.)
+	 */
+	if (checkProc->lockGroupLeader != NULL)
+		checkProc = checkProc->lockGroupLeader;
 
 	/*
 	 * Have we already seen this proc?
@@ -494,13 +495,57 @@ FindLockCycleRecurse(PGPROC *checkProc,
 	visitedProcs[nVisitedProcs++] = checkProc;
 
 	/*
-	 * If the proc is not waiting, we have no outgoing waits-for edges.
+	 * If the process is waiting, there is an outgoing waits-for edge to each
+	 * process that blocks it.
+	 */
+	if (checkProc->links.next != NULL && checkProc->waitLock != NULL &&
+		FindLockCycleRecurseMember(checkProc, checkProc, depth, softEdges,
+								   nSoftEdges))
+		return true;
+
+	/*
+	 * If the process is not waiting, there could still be outgoing waits-for
+	 * edges if it is part of a lock group, because other members of the lock
+	 * group might be waiting even though this process is not.  (Given lock
+	 * groups {A1, A2} and {B1, B2}, if A1 waits for B1 and B2 waits for A2,
+	 * that is a deadlock even neither of B1 and A2 are waiting for anything.)
 	 */
-	if (checkProc->links.next == NULL)
-		return false;
-	lock = checkProc->waitLock;
-	if (lock == NULL)
-		return false;
+	dlist_foreach(iter, &checkProc->lockGroupMembers)
+	{
+		PGPROC	   *memberProc;
+
+		memberProc = dlist_container(PGPROC, lockGroupLink, iter.cur);
+
+		if (memberProc->links.next != NULL && memberProc->waitLock != NULL &&
+			memberProc != checkProc &&
+		  FindLockCycleRecurseMember(memberProc, checkProc, depth, softEdges,
+									 nSoftEdges))
+			return true;
+	}
+
+	return false;
+}
+
+static bool
+FindLockCycleRecurseMember(PGPROC *checkProc,
+						   PGPROC *checkProcLeader,
+						   int depth,
+						   EDGE *softEdges,		/* output argument */
+						   int *nSoftEdges)		/* output argument */
+{
+	PGPROC	   *proc;
+	LOCK	   *lock = checkProc->waitLock;
+	PGXACT	   *pgxact;
+	PROCLOCK   *proclock;
+	SHM_QUEUE  *procLocks;
+	LockMethod	lockMethodTable;
+	PROC_QUEUE *waitQueue;
+	int			queue_size;
+	int			conflictMask;
+	int			i;
+	int			numLockModes,
+				lm;
+
 	lockMethodTable = GetLocksMethodTable(lock);
 	numLockModes = lockMethodTable->numLockModes;
 	conflictMask = lockMethodTable->conflictTab[checkProc->waitLockMode];
@@ -516,11 +561,14 @@ FindLockCycleRecurse(PGPROC *checkProc,
 
 	while (proclock)
 	{
+		PGPROC	   *leader;
+
 		proc = proclock->tag.myProc;
 		pgxact = &ProcGlobal->allPgXact[proc->pgprocno];
+		leader = proc->lockGroupLeader == NULL ? proc : proc->lockGroupLeader;
 
-		/* A proc never blocks itself */
-		if (proc != checkProc)
+		/* A proc never blocks itself or any other lock group member */
+		if (leader != checkProcLeader)
 		{
 			for (lm = 1; lm <= numLockModes; lm++)
 			{
@@ -601,10 +649,20 @@ FindLockCycleRecurse(PGPROC *checkProc,
 
 		for (i = 0; i < queue_size; i++)
 		{
+			PGPROC	   *leader;
+
 			proc = procs[i];
+			leader = proc->lockGroupLeader == NULL ? proc :
+				proc->lockGroupLeader;
 
-			/* Done when we reach the target proc */
-			if (proc == checkProc)
+			/*
+			 * TopoSort will always return an ordering with group members
+			 * adjacent to each other in the wait queue (see comments
+			 * therein). So, as soon as we reach a process in the same lock
+			 * group as checkProc, we know we've found all the conflicts that
+			 * precede any member of the lock group lead by checkProcLeader.
+			 */
+			if (leader == checkProcLeader)
 				break;
 
 			/* Is there a conflict with this guy's request? */
@@ -625,8 +683,9 @@ FindLockCycleRecurse(PGPROC *checkProc,
 					 * Add this edge to the list of soft edges in the cycle
 					 */
 					Assert(*nSoftEdges < MaxBackends);
-					softEdges[*nSoftEdges].waiter = checkProc;
-					softEdges[*nSoftEdges].blocker = proc;
+					softEdges[*nSoftEdges].waiter = checkProcLeader;
+					softEdges[*nSoftEdges].blocker = leader;
+					softEdges[*nSoftEdges].lock = lock;
 					(*nSoftEdges)++;
 					return true;
 				}
@@ -635,20 +694,52 @@ FindLockCycleRecurse(PGPROC *checkProc,
 	}
 	else
 	{
+		PGPROC	   *lastGroupMember = NULL;
+
 		/* Use the true lock wait queue order */
 		waitQueue = &(lock->waitProcs);
-		queue_size = waitQueue->size;
 
-		proc = (PGPROC *) waitQueue->links.next;
+		/*
+		 * Find the last member of the lock group that is present in the wait
+		 * queue.  Anything after this is not a soft lock conflict. If group
+		 * locking is not in use, then we know immediately which process we're
+		 * looking for, but otherwise we've got to search the wait queue to
+		 * find the last process actually present.
+		 */
+		if (checkProc->lockGroupLeader == NULL)
+			lastGroupMember = checkProc;
+		else
+		{
+			proc = (PGPROC *) waitQueue->links.next;
+			queue_size = waitQueue->size;
+			while (queue_size-- > 0)
+			{
+				if (proc->lockGroupLeader == checkProcLeader)
+					lastGroupMember = proc;
+				proc = (PGPROC *) proc->links.next;
+			}
+			Assert(lastGroupMember != NULL);
+		}
 
+		/*
+		 * OK, now rescan (or scan) the queue to identify the soft conflicts.
+		 */
+		queue_size = waitQueue->size;
+		proc = (PGPROC *) waitQueue->links.next;
 		while (queue_size-- > 0)
 		{
+			PGPROC	   *leader;
+
+			leader = proc->lockGroupLeader == NULL ? proc :
+				proc->lockGroupLeader;
+
 			/* Done when we reach the target proc */
-			if (proc == checkProc)
+			if (proc == lastGroupMember)
 				break;
 
 			/* Is there a conflict with this guy's request? */
-			if ((LOCKBIT_ON(proc->waitLockMode) & conflictMask) != 0)
+			if ((LOCKBIT_ON(proc->waitLockMode) & conflictMask) != 0 &&
+				leader != checkProcLeader)
 			{
 				/* This proc soft-blocks checkProc */
 				if (FindLockCycleRecurse(proc, depth + 1,
@@ -665,8 +756,9 @@ FindLockCycleRecurse(PGPROC *checkProc,
 					 * Add this edge to the list of soft edges in the cycle
 					 */
 					Assert(*nSoftEdges < MaxBackends);
-					softEdges[*nSoftEdges].waiter = checkProc;
-					softEdges[*nSoftEdges].blocker = proc;
+					softEdges[*nSoftEdges].waiter = checkProcLeader;
+					softEdges[*nSoftEdges].blocker = leader;
+					softEdges[*nSoftEdges].lock = lock;
 					(*nSoftEdges)++;
 					return true;
 				}
@@ -711,8 +803,7 @@ ExpandConstraints(EDGE *constraints,
 	 */
 	for (i = nConstraints; --i >= 0;)
 	{
-		PGPROC	   *proc = constraints[i].waiter;
-		LOCK	   *lock = proc->waitLock;
+		LOCK	   *lock = constraints[i].lock;
 
 		/* Did we already make a list for this lock? */
 		for (j = nWaitOrders; --j >= 0;)
@@ -778,7 +869,9 @@ TopoSort(LOCK *lock,
 	PGPROC	   *proc;
 	int			i,
 				j,
+				jj,
 				k,
+				kk,
 				last;
 
 	/* First, fill topoProcs[] array with the procs in their current order */
@@ -798,41 +891,95 @@ TopoSort(LOCK *lock,
 	 * stores its list link in constraints[i].link (note any constraint will
 	 * be in just one list). The array index for the before-proc of the i'th
 	 * constraint is remembered in constraints[i].pred.
+	 *
+	 * Note that it's not necessarily the case that every constraint affects
+	 * this particular wait queue.  Prior to group locking, a process could be
+	 * waiting for at most one lock.  But a lock group can be waiting for
+	 * zero, one, or multiple locks.  Since topoProcs[] is an array of the
+	 * processes actually waiting, while constraints[] is an array of group
+	 * leaders, we've got to scan through topoProcs[] for each constraint,
+	 * checking whether both a waiter and a blocker for that group are
+	 * present.  If so, the constraint is relevant to this wait queue; if not,
+	 * it isn't.
 	 */
 	MemSet(beforeConstraints, 0, queue_size * sizeof(int));
 	MemSet(afterConstraints, 0, queue_size * sizeof(int));
 	for (i = 0; i < nConstraints; i++)
 	{
+		/*
+		 * Find a representative process that is on the lock queue and part of
+		 * the waiting lock group.  This may or may not be the leader, which
+		 * may or may not be waiting at all.  If there are any other processes
+		 * in the same lock group on the queue, set their number of
+		 * beforeConstraints to -1 to indicate that they should be emitted
+		 * with their groupmates rather than considered separately.
+		 */
 		proc = constraints[i].waiter;
-		/* Ignore constraint if not for this lock */
-		if (proc->waitLock != lock)
-			continue;
-		/* Find the waiter proc in the array */
+		Assert(proc != NULL);
+		jj = -1;
 		for (j = queue_size; --j >= 0;)
 		{
-			if (topoProcs[j] == proc)
+			PGPROC	   *waiter = topoProcs[j];
+
+			if (waiter == proc || waiter->lockGroupLeader == proc)
+			{
+				Assert(waiter->waitLock == lock);
+				if (jj == -1)
+					jj = j;
+				else
+				{
+					Assert(beforeConstraints[j] <= 0);
+					beforeConstraints[j] = -1;
+				}
 				break;
+			}
 		}
-		Assert(j >= 0);			/* should have found a match */
-		/* Find the blocker proc in the array */
+
+		/* If no matching waiter, constraint is not relevant to this lock. */
+		if (jj < 0)
+			continue;
+
+		/*
+		 * Similarly, find a representative process that is on the lock queue
+		 * and waiting for the blocking lock group.  Again, this could be the
+		 * leader but does not need to be.
+		 */
 		proc = constraints[i].blocker;
+		Assert(proc != NULL);
+		kk = -1;
 		for (k = queue_size; --k >= 0;)
 		{
-			if (topoProcs[k] == proc)
-				break;
+			PGPROC	   *blocker = topoProcs[k];
+
+			if (blocker == proc || blocker->lockGroupLeader == proc)
+			{
+				Assert(blocker->waitLock == lock);
+				if (kk == -1)
+					kk = k;
+				else
+				{
+					Assert(beforeConstraints[k] <= 0);
+					beforeConstraints[k] = -1;
+				}
+			}
 		}
-		Assert(k >= 0);			/* should have found a match */
-		beforeConstraints[j]++; /* waiter must come before */
+
+		/* If no matching blocker, constraint is not relevant to this lock. */
+		if (kk < 0)
+			continue;
+
+		beforeConstraints[jj]++;	/* waiter must come before */
 		/* add this constraint to list of after-constraints for blocker */
-		constraints[i].pred = j;
-		constraints[i].link = afterConstraints[k];
-		afterConstraints[k] = i + 1;
+		constraints[i].pred = jj;
+		constraints[i].link = afterConstraints[kk];
+		afterConstraints[kk] = i + 1;
 	}
+
 	/*--------------------
 	 * Now scan the topoProcs array backwards.  At each step, output the
-	 * last proc that has no remaining before-constraints, and decrease
-	 * the beforeConstraints count of each of the procs it was constrained
-	 * against.
+	 * last proc that has no remaining before-constraints plus any other
+	 * members of the same lock group; then decrease the beforeConstraints
+	 * count of each of the procs it was constrained against.
 	 * i = index of ordering[] entry we want to output this time
 	 * j = search index for topoProcs[]
 	 * k = temp for scanning constraint list for proc j
@@ -840,8 +987,11 @@ TopoSort(LOCK *lock,
 	 *--------------------
 	 */
 	last = queue_size - 1;
-	for (i = queue_size; --i >= 0;)
+	for (i = queue_size - 1; i >= 0;)
 	{
+		int			c;
+		int			nmatches = 0;
+
 		/* Find next candidate to output */
 		while (topoProcs[last] == NULL)
 			last--;
@@ -850,12 +1000,37 @@ TopoSort(LOCK *lock,
 			if (topoProcs[j] != NULL && beforeConstraints[j] == 0)
 				break;
 		}
+
 		/* If no available candidate, topological sort fails */
 		if (j < 0)
 			return false;
-		/* Output candidate, and mark it done by zeroing topoProcs[] entry */
-		ordering[i] = topoProcs[j];
-		topoProcs[j] = NULL;
+
+		/*
+		 * Output everything in the lock group.  There's no point in outputing
+		 * an ordering where members of the same lock group are not
+		 * consecutive on the wait queue: if some other waiter is between two
+		 * requests that belong to the same group, then either it conflicts
+		 * with both of them and is certainly not a solution; or it conflicts
+		 * with at most one of them and is thus isomorphic to an ordering
+		 * where the group members are consecutive.
+		 */
+		proc = topoProcs[j];
+		if (proc->lockGroupLeader != NULL)
+			proc = proc->lockGroupLeader;
+		Assert(proc != NULL);
+		for (c = 0; c <= last; ++c)
+		{
+			if (topoProcs[c] == proc || (topoProcs[c] != NULL &&
+									  topoProcs[c]->lockGroupLeader == proc))
+			{
+				ordering[i - nmatches] = topoProcs[c];
+				topoProcs[c] = NULL;
+				++nmatches;
+			}
+		}
+		Assert(nmatches > 0);
+		i -= nmatches;
+
 		/* Update beforeConstraints counts of its predecessors */
 		for (k = afterConstraints[j]; k > 0; k = constraints[k - 1].link)
 			beforeConstraints[constraints[k - 1].pred]--;
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 269fe14..e3e9599 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -35,6 +35,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
+#include "access/xact.h"
 #include "access/xlog.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -1136,6 +1137,18 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 	{
 		uint32		partition = LockHashPartition(hashcode);
 
+		/*
+		 * It might seem unsafe to access proclock->groupLeader without a lock,
+		 * but it's not really.  Either we are initializing a proclock on our
+		 * own behalf, in which case our group leader isn't changing because
+		 * the group leader for a process can only ever be changed by the
+		 * process itself; or else we are transferring a fast-path lock to the
+		 * main lock table, in which case that process can't change it's lock
+		 * group leader without first releasing all of its locks (and in
+		 * particular the one we are currently transferring).
+		 */
+		proclock->groupLeader = proc->lockGroupLeader != NULL ?
+			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
 		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
@@ -1255,9 +1268,10 @@ RemoveLocalLock(LOCALLOCK *locallock)
  * NOTES:
  *		Here's what makes this complicated: one process's locks don't
  * conflict with one another, no matter what purpose they are held for
- * (eg, session and transaction locks do not conflict).
- * So, we must subtract off our own locks when determining whether the
- * requested new lock conflicts with those already held.
+ * (eg, session and transaction locks do not conflict).  Nor do the locks
+ * of one process in a lock group conflict with those of another process in
+ * the same group.  So, we must subtract off these locks when determining
+ * whether the requested new lock conflicts with those already held.
  */
 int
 LockCheckConflicts(LockMethod lockMethodTable,
@@ -1267,8 +1281,12 @@ LockCheckConflicts(LockMethod lockMethodTable,
 {
 	int			numLockModes = lockMethodTable->numLockModes;
 	LOCKMASK	myLocks;
-	LOCKMASK	otherLocks;
+	int			conflictMask = lockMethodTable->conflictTab[lockmode];
+	int			conflictsRemaining[MAX_LOCKMODES];
+	int			totalConflictsRemaining = 0;
 	int			i;
+	SHM_QUEUE  *procLocks;
+	PROCLOCK   *otherproclock;
 
 	/*
 	 * first check for global conflicts: If no locks conflict with my request,
@@ -1279,40 +1297,91 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	 * type of lock that conflicts with request.   Bitwise compare tells if
 	 * there is a conflict.
 	 */
-	if (!(lockMethodTable->conflictTab[lockmode] & lock->grantMask))
+	if (!(conflictMask & lock->grantMask))
 	{
 		PROCLOCK_PRINT("LockCheckConflicts: no conflict", proclock);
 		return STATUS_OK;
 	}
 
 	/*
-	 * Rats.  Something conflicts.  But it could still be my own lock. We have
-	 * to construct a conflict mask that does not reflect our own locks, but
-	 * only lock types held by other processes.
+	 * Rats.  Something conflicts.  But it could still be my own lock, or
+	 * a lock held by another member of my locking group.  First, figure out
+	 * how many conflicts remain after subtracting out any locks I hold
+	 * myself.
 	 */
 	myLocks = proclock->holdMask;
-	otherLocks = 0;
 	for (i = 1; i <= numLockModes; i++)
 	{
-		int			myHolding = (myLocks & LOCKBIT_ON(i)) ? 1 : 0;
+		if ((conflictMask & LOCKBIT_ON(i)) == 0)
+		{
+			conflictsRemaining[i] = 0;
+			continue;
+		}
+		conflictsRemaining[i] = lock->granted[i];
+		if (myLocks & LOCKBIT_ON(i))
+			--conflictsRemaining[i];
+		totalConflictsRemaining += conflictsRemaining[i];
+	}
 
-		if (lock->granted[i] > myHolding)
-			otherLocks |= LOCKBIT_ON(i);
+	/* If no conflicts remain, we get the lock. */
+	if (totalConflictsRemaining == 0)
+	{
+		PROCLOCK_PRINT("LockCheckConflicts: resolved (simple)", proclock);
+		return STATUS_OK;
+	}
+
+	/* If no group locking, it's definitely a conflict. */
+	if (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)
+	{
+		Assert(proclock->tag.myProc == MyProc);
+		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
+					   proclock);
+		return STATUS_FOUND;
 	}
 
 	/*
-	 * now check again for conflicts.  'otherLocks' describes the types of
-	 * locks held by other processes.  If one of these conflicts with the kind
-	 * of lock that I want, there is a conflict and I have to sleep.
+	 * Locks held in conflicting modes by members of our own lock group are
+	 * not real conflicts; we can subtract those out and see if we still have
+	 * a conflict.  This is O(N) in the number of processes holding or awaiting
+	 * locks on this object.  We could improve that by making the shared memory
+	 * state more complex (and larger) but it doesn't seem worth it.
 	 */
-	if (!(lockMethodTable->conflictTab[lockmode] & otherLocks))
+	procLocks = &(lock->procLocks);
+	otherproclock = (PROCLOCK *)
+		SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+	while (otherproclock != NULL)
 	{
-		/* no conflict. OK to get the lock */
-		PROCLOCK_PRINT("LockCheckConflicts: resolved", proclock);
-		return STATUS_OK;
+		if (proclock != otherproclock &&
+			proclock->groupLeader == otherproclock->groupLeader &&
+			(otherproclock->holdMask & conflictMask) != 0)
+		{
+			int	intersectMask = otherproclock->holdMask & conflictMask;
+
+			for (i = 1; i <= numLockModes; i++)
+			{
+				if ((intersectMask & LOCKBIT_ON(i)) != 0)
+				{
+					if (conflictsRemaining[i] <= 0)
+						elog(PANIC, "proclocks held do not match lock");
+					conflictsRemaining[i]--;
+					totalConflictsRemaining--;
+				}
+			}
+
+			if (totalConflictsRemaining == 0)
+			{
+				PROCLOCK_PRINT("LockCheckConflicts: resolved (group)",
+							   proclock);
+				return STATUS_OK;
+			}
+		}
+		otherproclock = (PROCLOCK *)
+			SHMQueueNext(procLocks, &otherproclock->lockLink,
+						 offsetof(PROCLOCK, lockLink));
 	}
 
-	PROCLOCK_PRINT("LockCheckConflicts: conflicting", proclock);
+	/* Nope, it's a real conflict. */
+	PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)", proclock);
 	return STATUS_FOUND;
 }
 
@@ -3095,6 +3164,10 @@ PostPrepare_Locks(TransactionId xid)
 	PROCLOCKTAG proclocktag;
 	int			partition;
 
+	/* Can't prepare a lock group follower. */
+	Assert(MyProc->lockGroupLeader == NULL ||
+		   MyProc->lockGroupLeader == MyProc);
+
 	/* This is a critical section: any error means big trouble */
 	START_CRIT_SECTION();
 
@@ -3239,6 +3312,13 @@ PostPrepare_Locks(TransactionId xid)
 			proclocktag.myProc = newproc;
 
 			/*
+			 * Update groupLeader pointer to point to the new proc.  (We'd
+			 * better not be a member of somebody else's lock group!)
+			 */
+			Assert(proclock->groupLeader == proclock->tag.myProc);
+			proclock->groupLeader = newproc;
+
+			/*
 			 * Update the proclock.  We should not find any existing entry for
 			 * the same hash key, since there can be only one entry for any
 			 * given lock with my own proc.
@@ -3785,6 +3865,8 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 	 */
 	if (!found)
 	{
+		Assert(proc->lockGroupLeader == NULL);
+		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
 		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 3690753..084be5a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -263,6 +263,9 @@ InitProcGlobal(void)
 		/* Initialize myProcLocks[] shared memory queues. */
 		for (j = 0; j < NUM_LOCK_PARTITIONS; j++)
 			SHMQueueInit(&(procs[i].myProcLocks[j]));
+
+		/* Initialize lockGroupMembers list. */
+		dlist_init(&procs[i].lockGroupMembers);
 	}
 
 	/*
@@ -397,6 +400,11 @@ InitProcess(void)
 	MyProc->backendLatestXid = InvalidTransactionId;
 	pg_atomic_init_u32(&MyProc->nextClearXidElem, INVALID_PGPROCNO);
 
+	/* Check that group locking fields are in a proper initial state. */
+	Assert(MyProc->lockGroupLeaderIdentifier == 0);
+	Assert(MyProc->lockGroupLeader == NULL);
+	Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+
 	/*
 	 * Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
 	 * on it.  That allows us to repoint the process latch, which so far
@@ -556,6 +564,11 @@ InitAuxiliaryProcess(void)
 	OwnLatch(&MyProc->procLatch);
 	SwitchToSharedLatch();
 
+	/* Check that group locking fields are in a proper initial state. */
+	Assert(MyProc->lockGroupLeaderIdentifier == 0);
+	Assert(MyProc->lockGroupLeader == NULL);
+	Assert(dlist_is_empty(&MyProc->lockGroupMembers));
+
 	/*
 	 * We might be reusing a semaphore that belonged to a failed process. So
 	 * be careful and reinitialize its value here.  (This is not strictly
@@ -794,6 +807,40 @@ ProcKill(int code, Datum arg)
 		ReplicationSlotRelease();
 
 	/*
+	 * Detach from any lock group of which we are a member.  If the leader
+	 * exist before all other group members, it's PGPROC will remain allocated
+	 * until the last group process exits; that process must return the
+	 * leader's PGPROC to the appropriate list.
+	 */
+	if (MyProc->lockGroupLeader != NULL)
+	{
+		PGPROC	   *leader = MyProc->lockGroupLeader;
+		LWLock	   *leader_lwlock = LockHashPartitionLockByProc(leader);
+
+		LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+		Assert(!dlist_is_empty(&leader->lockGroupMembers));
+		dlist_delete(&MyProc->lockGroupLink);
+		if (dlist_is_empty(&leader->lockGroupMembers))
+		{
+			leader->lockGroupLeaderIdentifier = 0;
+			leader->lockGroupLeader = NULL;
+			if (leader != MyProc)
+			{
+				procgloballist = leader->procgloballist;
+
+				/* Leader exited first; return its PGPROC. */
+				SpinLockAcquire(ProcStructLock);
+				leader->links.next = (SHM_QUEUE *) *procgloballist;
+				*procgloballist = leader;
+				SpinLockRelease(ProcStructLock);
+			}
+		}
+		else if (leader != MyProc)
+			MyProc->lockGroupLeader = NULL;
+		LWLockRelease(leader_lwlock);
+	}
+
+	/*
 	 * Reset MyLatch to the process local one.  This is so that signal
 	 * handlers et al can continue using the latch after the shared latch
 	 * isn't ours anymore. After that clear MyProc and disown the shared
@@ -807,9 +854,20 @@ ProcKill(int code, Datum arg)
 	procgloballist = proc->procgloballist;
 	SpinLockAcquire(ProcStructLock);
 
-	/* Return PGPROC structure (and semaphore) to appropriate freelist */
-	proc->links.next = (SHM_QUEUE *) *procgloballist;
-	*procgloballist = proc;
+	/*
+	 * If we're still a member of a locking group, that means we're a leader
+	 * which has somehow exited before its children.  The last remaining child
+	 * will release our PGPROC.  Otherwise, release it now.
+	 */
+	if (proc->lockGroupLeader == NULL)
+	{
+		/* Since lockGroupLeader is NULL, lockGroupMembers should be empty. */
+		Assert(dlist_is_empty(&proc->lockGroupMembers));
+
+		/* Return PGPROC structure (and semaphore) to appropriate freelist */
+		proc->links.next = (SHM_QUEUE *) *procgloballist;
+		*procgloballist = proc;
+	}
 
 	/* Update shared estimate of spins_per_delay */
 	ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
@@ -942,9 +1000,31 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 	bool		allow_autovacuum_cancel = true;
 	int			myWaitStatus;
 	PGPROC	   *proc;
+	PGPROC	   *leader = MyProc->lockGroupLeader;
 	int			i;
 
 	/*
+	 * If group locking is in use, locks held my members of my locking group
+	 * need to be included in myHeldLocks.
+	 */
+	if (leader != NULL)
+	{
+		SHM_QUEUE  *procLocks = &(lock->procLocks);
+		PROCLOCK   *otherproclock;
+
+		otherproclock = (PROCLOCK *)
+			SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
+		while (otherproclock != NULL)
+		{
+			if (otherproclock->groupLeader == leader)
+				myHeldLocks |= otherproclock->holdMask;
+			otherproclock = (PROCLOCK *)
+				SHMQueueNext(procLocks, &otherproclock->lockLink,
+							 offsetof(PROCLOCK, lockLink));
+		}
+	}
+
+	/*
 	 * Determine where to add myself in the wait queue.
 	 *
 	 * Normally I should go at the end of the queue.  However, if I already
@@ -968,6 +1048,15 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		proc = (PGPROC *) waitQueue->links.next;
 		for (i = 0; i < waitQueue->size; i++)
 		{
+			/*
+			 * If we're part of the same locking group as this waiter, its
+			 * locks neither conflict with ours nor contribute to aheadRequsts.
+			 */
+			if (leader != NULL && leader == proc->lockGroupLeader)
+			{
+				proc = (PGPROC *) proc->links.next;
+				continue;
+			}
 			/* Must he wait for me? */
 			if (lockMethodTable->conflictTab[proc->waitLockMode] & myHeldLocks)
 			{
@@ -1658,3 +1747,66 @@ ProcSendSignal(int pid)
 		SetLatch(&proc->procLatch);
 	}
 }
+
+/*
+ * BecomeLockGroupLeader - designate process as lock group leader
+ *
+ * Once this function has returned, other processes can join the lock group
+ * by calling BecomeLockGroupMember.
+ */
+void
+BecomeLockGroupLeader(void)
+{
+	LWLock	   *leader_lwlock;
+
+	/* If we already did it, we don't need to do it again. */
+	if (MyProc->lockGroupLeader == MyProc)
+		return;
+
+	/* We had better not be a follower. */
+	Assert(MyProc->lockGroupLeader == NULL);
+
+	/* Create single-member group, containing only ourselves. */
+	leader_lwlock = LockHashPartitionLockByProc(MyProc);
+	LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+	MyProc->lockGroupLeader = MyProc;
+	MyProc->lockGroupLeaderIdentifier = MyProcPid;
+	dlist_push_head(&MyProc->lockGroupMembers, &MyProc->lockGroupLink);
+	LWLockRelease(leader_lwlock);
+}
+
+/*
+ * BecomeLockGroupMember - designate process as lock group member
+ *
+ * This is pretty straightforward except for the possibility that the leader
+ * whose group we're trying to join might exit before we manage to do so;
+ * and the PGPROC might get recycled for an unrelated process.  To avoid
+ * that, we require the caller to pass the PID of the intended PGPROC as
+ * an interlock.  Returns true if we successfully join the intended lock
+ * group, and false if not.
+ */
+bool
+BecomeLockGroupMember(PGPROC *leader, int pid)
+{
+	LWLock	   *leader_lwlock;
+	bool		ok = false;
+
+	/* Group leader can't become member of group */
+	Assert(MyProc != leader);
+
+	/* PID must be valid. */
+	Assert(pid != 0);
+
+	/* Try to join the group. */
+	leader_lwlock = LockHashPartitionLockByProc(MyProc);
+	LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
+	if (leader->lockGroupLeaderIdentifier == pid)
+	{
+		ok = true;
+		MyProc->lockGroupLeader = leader;
+		dlist_push_tail(&leader->lockGroupMembers, &MyProc->lockGroupLink);
+	}
+	LWLockRelease(leader_lwlock);
+
+	return ok;
+}
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 43eca86..6b4e365 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -346,6 +346,7 @@ typedef struct PROCLOCK
 	PROCLOCKTAG tag;			/* unique identifier of proclock object */
 
 	/* data */
+	PGPROC	   *groupLeader;	/* group leader, or NULL if no lock group */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
 	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	SHM_QUEUE	lockLink;		/* list link in LOCK's list of proclocks */
@@ -457,7 +458,6 @@ typedef enum
 								 * worker */
 } DeadLockState;
 
-
 /*
  * The lockmgr's shared hash tables are partitioned to reduce contention.
  * To determine which partition a given locktag belongs to, compute the tag's
@@ -473,6 +473,17 @@ typedef enum
 	(&MainLWLockArray[LOCK_MANAGER_LWLOCK_OFFSET + (i)].lock)
 
 /*
+ * The deadlock detector needs to be able to access lockGroupLeader and
+ * related fields in the PGPROC, so we arrange for those fields to be protected
+ * by one of the lock hash partition locks.  Since the deadlock detector
+ * acquires all such locks anyway, this makes it safe for it to access these
+ * fields without doing anything extra.  To avoid contention as much as
+ * possible, we map different PGPROCs to different partition locks.
+ */
+#define LockHashPartitionLockByProc(p) \
+	LockHashPartitionLock((p)->pgprocno)
+
+/*
  * function prototypes
  */
 extern void InitLocks(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3441288..66ab255 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -155,6 +155,15 @@ struct PGPROC
 	bool		fpVXIDLock;		/* are we holding a fast-path VXID lock? */
 	LocalTransactionId fpLocalTransactionId;	/* lxid for fast-path VXID
 												 * lock */
+
+	/*
+	 * Support for lock groups.  Use LockHashPartitionLockByProc to get the
+	 * LWLock protecting these fields.
+	 */
+	int			lockGroupLeaderIdentifier;	/* MyProcPid, if I'm a leader */
+	PGPROC	   *lockGroupLeader;	/* lock group leader, if I'm a follower */
+	dlist_head	lockGroupMembers;	/* list of members, if I'm a leader */
+	dlist_node  lockGroupLink;		/* my member link, if I'm a member */
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
@@ -272,4 +281,7 @@ extern void LockErrorCleanup(void);
 extern void ProcWaitForSignal(void);
 extern void ProcSendSignal(int pid);
 
+extern void BecomeLockGroupLeader(void);
+extern bool BecomeLockGroupMember(PGPROC *leader, int pid);
+
 #endif   /* PROC_H */
-- 
2.5.4 (Apple Git-61)

From b101d27611dd42109f11b09ab3ba65dba91e6341 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 21 Jan 2016 14:33:07 -0500
Subject: [PATCH 2/3] contrib module test_group_deadlocks, not for commit.

Amit Kapila
---
 contrib/Makefile                                   |  1 +
 contrib/test_group_deadlocks/Makefile              | 19 ++++++++
 .../test_group_deadlocks--1.0.sql                  | 15 ++++++
 .../test_group_deadlocks/test_group_deadlocks.c    | 57 ++++++++++++++++++++++
 .../test_group_deadlocks.control                   |  5 ++
 5 files changed, 97 insertions(+)
 create mode 100644 contrib/test_group_deadlocks/Makefile
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.c
 create mode 100644 contrib/test_group_deadlocks/test_group_deadlocks.control

diff --git a/contrib/Makefile b/contrib/Makefile
index bd251f6..ff3c54d 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -43,6 +43,7 @@ SUBDIRS = \
 		tablefunc	\
 		tcn		\
 		test_decoding	\
+		test_group_deadlocks	\
 		tsm_system_rows \
 		tsm_system_time \
 		tsearch2	\
diff --git a/contrib/test_group_deadlocks/Makefile b/contrib/test_group_deadlocks/Makefile
new file mode 100644
index 0000000..057448c
--- /dev/null
+++ b/contrib/test_group_deadlocks/Makefile
@@ -0,0 +1,19 @@
+# contrib/test_group_deadlocks/Makefile
+
+MODULE_big = test_group_deadlocks
+OBJS = test_group_deadlocks.o $(WIN32RES)
+
+EXTENSION = test_group_deadlocks
+DATA = test_group_deadlocks--1.0.sql
+PGFILEDESC = "test_group_deadlocks - participate in group locking"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_group_deadlocks
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
new file mode 100644
index 0000000..377c363
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql
@@ -0,0 +1,15 @@
+/* contrib/test_group_deadlocks/test_group_deadlocks--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_group_deadlocks" to load this file. \quit
+
+-- Register the function.
+CREATE FUNCTION become_lock_group_leader()
+RETURNS pg_catalog.void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION become_lock_group_member(pid pg_catalog.int4)
+RETURNS pg_catalog.bool
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.c b/contrib/test_group_deadlocks/test_group_deadlocks.c
new file mode 100644
index 0000000..f3d980a
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.c
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_group_deadlocks.c
+ *		  group locking utilities
+ *
+ * Copyright (c) 2010-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/test_group_deadlocks/test_group_deadlocks.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(become_lock_group_leader);
+PG_FUNCTION_INFO_V1(become_lock_group_member);
+
+
+/*
+ * become_lock_group_leader
+ *
+ * This function makes current backend process as lock group
+ * leader.
+ */
+Datum
+become_lock_group_leader(PG_FUNCTION_ARGS)
+{
+	BecomeLockGroupLeader();
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * become_lock_group_member
+ *
+ * This function makes current backend process as lock group
+ * member of the group owned by the process whose pid is passed
+ * as first argument.
+ */
+Datum
+become_lock_group_member(PG_FUNCTION_ARGS)
+{
+	bool		member;
+	PGPROC		*procleader;
+	int32		pid = PG_GETARG_INT32(0);
+
+	procleader = BackendPidGetProc(pid);
+	member = BecomeLockGroupMember(procleader, pid);
+
+	PG_RETURN_BOOL(member);
+}
diff --git a/contrib/test_group_deadlocks/test_group_deadlocks.control b/contrib/test_group_deadlocks/test_group_deadlocks.control
new file mode 100644
index 0000000..e2dcc71
--- /dev/null
+++ b/contrib/test_group_deadlocks/test_group_deadlocks.control
@@ -0,0 +1,5 @@
+# test_group_locking extension
+comment = 'become part of group'
+default_version = '1.0'
+module_pathname = '$libdir/test_group_deadlocks'
+relocatable = true
-- 
2.5.4 (Apple Git-61)

From c6b2249ce16f278287dcee0710ca469c271c5cab Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 30 Sep 2015 18:35:40 -0400
Subject: [PATCH 3/3] Introduce a new GUC force_parallel_mode for testing
 purposes.

When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe.  For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.

Robert Haas, with help from Amit Kapila and Rushabh Lathia.
---
 doc/src/sgml/config.sgml                      | 45 +++++++++++++++++
 src/backend/access/transam/parallel.c         |  4 +-
 src/backend/commands/explain.c                | 14 ++++-
 src/backend/nodes/copyfuncs.c                 |  1 +
 src/backend/nodes/outfuncs.c                  |  2 +
 src/backend/nodes/readfuncs.c                 |  1 +
 src/backend/optimizer/plan/createplan.c       |  5 ++
 src/backend/optimizer/plan/planner.c          | 73 ++++++++++++++++++++-------
 src/backend/utils/misc/guc.c                  | 24 +++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/nodes/plannodes.h                 |  1 +
 src/include/nodes/relation.h                  |  3 ++
 src/include/optimizer/planmain.h              |  9 ++++
 13 files changed, 163 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..de84b77 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3802,6 +3802,51 @@ SELECT * FROM parent WHERE key = 2400;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-force-parallel-mode" xreflabel="force_parallel_mode">
+      <term><varname>force_parallel_mode</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>force_parallel_mode</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Allows the use of parallel queries for testing purposes even in cases
+        where no performance benefit is expected.
+        The allowed values of <varname>force_parallel_mode</> are
+        <literal>off</> (use parallel mode only when it is expected to improve
+        performance), <literal>on</> (force parallel query for all queries
+        for which it is thought to be safe), and <literal>regress</> (like
+        on, but with additional behavior changes to facilitate automated
+        regression testing).
+       </para>
+
+       <para>
+        More specifically, setting this value to <literal>on</> will add
+        a <literal>Gather</> node to the top of any query plan for which this
+        appears to be safe, so that the query runs inside of a parallel worker.
+        Even when a parallel worker is not available or cannot be used,
+        operations such as starting a subtransaction that would be prohibited
+        in a parallel query context will be prohibited unless the planner
+        believes that this will cause the query to fail.  If failures or
+        unexpected results occur when this option is set, some functions used
+        by the query may need to be marked <literal>PARALLEL UNSAFE</literal>
+        (or, possibly, <literal>PARALLEL RESTRICTED</literal>).
+       </para>
+
+       <para>
+        Setting this value to <literal>regress</> has all of the same effects
+        as setting it to <literal>on</> plus some additional effect that are
+        intended to facilitate automated regression testing.  Normally,
+        messages from a parallel worker are prefixed with a context line,
+        but a setting of <literal>regress</> suppresses this to guarantee
+        reproducible results.  Also, the <literal>Gather</> nodes added to
+        plans by this setting are hidden from the <literal>EXPLAIN</> output
+        so that the output matches what would be obtained if this setting
+        were turned <literal>off</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
    </sect1>
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index bf2e691..4f91cd0 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -22,6 +22,7 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
+#include "optimizer/planmain.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
@@ -1079,7 +1080,8 @@ ParallelExtensionTrampoline(dsm_segment *seg, shm_toc *toc)
 static void
 ParallelErrorContext(void *arg)
 {
-	errcontext("parallel worker, PID %d", *(int32 *) arg);
+	if (force_parallel_mode != FORCE_PARALLEL_REGRESS)
+		errcontext("parallel worker, PID %d", *(int32 *) arg);
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 25d8ca0..ee13136 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -23,6 +23,7 @@
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
+#include "optimizer/planmain.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteHandler.h"
 #include "tcop/tcopprot.h"
@@ -572,6 +573,7 @@ void
 ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 {
 	Bitmapset  *rels_used = NULL;
+	PlanState *ps;
 
 	Assert(queryDesc->plannedstmt != NULL);
 	es->pstmt = queryDesc->plannedstmt;
@@ -580,7 +582,17 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 	es->rtable_names = select_rtable_names_for_explain(es->rtable, rels_used);
 	es->deparse_cxt = deparse_context_for_plan_rtable(es->rtable,
 													  es->rtable_names);
-	ExplainNode(queryDesc->planstate, NIL, NULL, NULL, es);
+
+	/*
+	 * Sometimes we mark a Gather node as "invisible", which means that it's
+	 * not displayed in EXPLAIN output.  The purpose of this is to allow
+	 * running regression tests with force_parallel_mode=regress to get the
+	 * same results as running the same tests with force_parallel_mode=off.
+	 */
+	ps = queryDesc->planstate;
+	if (IsA(ps, GatherState) &&((Gather *) ps->plan)->invisible)
+		ps = outerPlanState(ps);
+	ExplainNode(ps, NIL, NULL, NULL, es);
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8b79fa..e54d174 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -334,6 +334,7 @@ _copyGather(const Gather *from)
 	 */
 	COPY_SCALAR_FIELD(num_workers);
 	COPY_SCALAR_FIELD(single_copy);
+	COPY_SCALAR_FIELD(invisible);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b487c00..97b7fef 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -443,6 +443,7 @@ _outGather(StringInfo str, const Gather *node)
 
 	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
+	WRITE_BOOL_FIELD(invisible);
 }
 
 static void
@@ -1826,6 +1827,7 @@ _outPlannerGlobal(StringInfo str, const PlannerGlobal *node)
 	WRITE_BOOL_FIELD(hasRowSecurity);
 	WRITE_BOOL_FIELD(parallelModeOK);
 	WRITE_BOOL_FIELD(parallelModeNeeded);
+	WRITE_BOOL_FIELD(wholePlanParallelSafe);
 	WRITE_BOOL_FIELD(hasForeignJoin);
 }
 
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 6c46151..e4d41ee 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2053,6 +2053,7 @@ _readGather(void)
 
 	READ_INT_FIELD(num_workers);
 	READ_BOOL_FIELD(single_copy);
+	READ_BOOL_FIELD(invisible);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 54ff7f6..6e0db08 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -212,6 +212,10 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree */
 	plan = create_plan_recurse(root, best_path);
 
+	/* Update parallel safety information if needed. */
+	if (!best_path->parallel_safe)
+		root->glob->wholePlanParallelSafe = false;
+
 	/* Check we successfully assigned all NestLoopParams to plan nodes */
 	if (root->curOuterParams != NIL)
 		elog(ERROR, "failed to assign all NestLoopParams to plan nodes");
@@ -4829,6 +4833,7 @@ make_gather(List *qptlist,
 	plan->righttree = NULL;
 	node->num_workers = nworkers;
 	node->single_copy = single_copy;
+	node->invisible	= false;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a09b4b5..a3cc274 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -48,10 +48,12 @@
 #include "storage/dsm_impl.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/syscache.h"
 
 
-/* GUC parameter */
+/* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
+int			force_parallel_mode = FORCE_PARALLEL_OFF;
 
 /* Hook for plugins to get control in planner() */
 planner_hook_type planner_hook = NULL;
@@ -230,25 +232,31 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		!has_parallel_hazard((Node *) parse, true);
 
 	/*
-	 * glob->parallelModeOK should tell us whether it's necessary to impose
-	 * the parallel mode restrictions, but we don't actually want to impose
-	 * them unless we choose a parallel plan, so that people who mislabel
-	 * their functions but don't use parallelism anyway aren't harmed.
-	 * However, it's useful for testing purposes to be able to force the
-	 * restrictions to be imposed whenever a parallel plan is actually chosen
-	 * or not.
+	 * glob->parallelModeNeeded should tell us whether it's necessary to
+	 * impose the parallel mode restrictions, but we don't actually want to
+	 * impose them unless we choose a parallel plan, so that people who
+	 * mislabel their functions but don't use parallelism anyway aren't
+	 * harmed. But when force_parallel_mode is set, we enable the restrictions
+	 * whenever possible for testing purposes.
 	 *
-	 * (It's been suggested that we should always impose these restrictions
-	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing to
-	 * do, but for now I've taken this approach.  We could also control this
-	 * with a GUC.)
+	 * glob->wholePlanParallelSafe should tell us whether it's OK to stick a
+	 * Gather node on top of the entire plan.  However, it only needs to be
+	 * accurate when force_parallel_mode is 'on' or 'regress', so we don't
+	 * bother doing the work otherwise.  The value we set here is just a
+	 * preliminary guess; it may get changed from true to false later, but
+	 * not visca versa.
 	 */
-#ifdef FORCE_PARALLEL_MODE
-	glob->parallelModeNeeded = glob->parallelModeOK;
-#else
-	glob->parallelModeNeeded = false;
-#endif
+	if (force_parallel_mode == FORCE_PARALLEL_OFF || !glob->parallelModeOK)
+	{
+		glob->parallelModeNeeded = false;
+		glob->wholePlanParallelSafe = false;	/* either false or don't care */
+	}
+	else
+	{
+		glob->parallelModeNeeded = true;
+		glob->wholePlanParallelSafe =
+			!has_parallel_hazard((Node *) parse, false);
+	}
 
 	/* Determine what fraction of the plan is likely to be scanned */
 	if (cursorOptions & CURSOR_OPT_FAST_PLAN)
@@ -293,6 +301,35 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	}
 
 	/*
+	 * At present, we don't copy subplans to workers.  The presence of a
+	 * subplan in one part of the plan doesn't preclude the use of parallelism
+	 * in some other part of the plan, but it does preclude the possibility of
+	 * regarding the entire plan parallel-safe.
+	 */
+	if (glob->subplans != NULL)
+		glob->wholePlanParallelSafe = false;
+
+	/*
+	 * Optionally add a Gather node for testing purposes, provided this is
+	 * actually a safe thing to do.
+	 */
+	if (glob->wholePlanParallelSafe &&
+		force_parallel_mode != FORCE_PARALLEL_OFF)
+	{
+		Gather	   *gather = makeNode(Gather);
+
+		gather->plan.targetlist = top_plan->targetlist;
+		gather->plan.qual = NIL;
+		gather->plan.lefttree = top_plan;
+		gather->plan.righttree = NULL;
+		gather->num_workers = 1;
+		gather->single_copy = true;
+		gather->invisible = (force_parallel_mode == FORCE_PARALLEL_REGRESS);
+		root->glob->parallelModeNeeded = true;
+		top_plan = &gather->plan;
+	}
+
+	/*
 	 * If any Params were generated, run through the plan tree and compute
 	 * each plan node's extParam/allParam sets.  Ideally we'd merge this into
 	 * set_plan_references' tree traversal, but for now it has to be separate
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 38ba82f..14212ee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -379,6 +379,19 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry force_parallel_mode_options[] = {
+	{"off", FORCE_PARALLEL_OFF, false},
+	{"on", FORCE_PARALLEL_ON, false},
+	{"regress", FORCE_PARALLEL_REGRESS, false},
+	{"true", FORCE_PARALLEL_ON, true},
+	{"false", FORCE_PARALLEL_OFF, true},
+	{"yes", FORCE_PARALLEL_ON, true},
+	{"no", FORCE_PARALLEL_OFF, true},
+	{"1", FORCE_PARALLEL_ON, true},
+	{"0", FORCE_PARALLEL_OFF, true},
+	{NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -863,6 +876,7 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+
 	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
@@ -3672,6 +3686,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"force_parallel_mode", PGC_USERSET, QUERY_TUNING_OTHER,
+			gettext_noop("Forces use of parallel query facilities."),
+			gettext_noop("If possible, run query using a parallel worker and with parallel restrictions.")
+		},
+		&force_parallel_mode,
+		FORCE_PARALLEL_OFF, force_parallel_mode_options,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 029114f..09b2003 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -313,6 +313,7 @@
 #from_collapse_limit = 8
 #join_collapse_limit = 8		# 1 disables collapsing of explicit
 					# JOIN clauses
+#force_parallel_mode = off
 
 
 #------------------------------------------------------------------------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 55d6bbe..ae224cf 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -775,6 +775,7 @@ typedef struct Gather
 	Plan		plan;
 	int			num_workers;
 	bool		single_copy;
+	bool		invisible;		/* suppress EXPLAIN display (for testing)? */
 } Gather;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 9492598..5c22679 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -108,6 +108,9 @@ typedef struct PlannerGlobal
 	bool		parallelModeOK; /* parallel mode potentially OK? */
 
 	bool		parallelModeNeeded;		/* parallel mode actually required? */
+
+	bool		wholePlanParallelSafe;	/* is the entire plan parallel safe? */
+
 	bool		hasForeignJoin;	/* does have a pushed down foreign join */
 } PlannerGlobal;
 
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 7ae7367..eaa642b 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -17,9 +17,18 @@
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
+/* possible values for force_parallel_mode */
+typedef enum
+{
+	FORCE_PARALLEL_OFF,
+	FORCE_PARALLEL_ON,
+	FORCE_PARALLEL_REGRESS
+} ForceParallelMode;
+
 /* GUC parameters */
 #define DEFAULT_CURSOR_TUPLE_FRACTION 0.1
 extern double cursor_tuple_fraction;
+extern int force_parallel_mode;
 
 /* query_planner callback to compute query_pathkeys */
 typedef void (*query_pathkeys_callback) (PlannerInfo *root, void *extra);
-- 
2.5.4 (Apple Git-61)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a raft of parallelism-related bug fixes

Reply via email to