On 10/13/2014 10:47 AM, Heikki Linnakangas wrote:
On 10/10/2014 05:08 PM, MauMau wrote:
From: "Craig Ringer" <cr...@2ndquadrant.com>
It sounds like they've produced a test case, so they should be able to
with a bit of luck.

Or even better, send you the test case.

I asked the user about this.  It sounds like the relevant test case consists
of many scripts.  He explained to me that the simplified test steps are:

1. initdb
2. pg_ctl start
3. Create 16 tables.  Each of those tables consist of around 10 columns.
4. Insert 1000 rows into each of those 16 tables.
5. Launch 16 psql sessions concurrently.  Each session updates all 1000 rows
of one table, e.g., session 1 updates table 1, session 2 updates table 2,
and so on.
6. Repeat step 5 50 times.

This sounds a bit complicated, but I understood that the core part is 16
concurrent updates, which should lead to contention on xlog insert slots
and/or spinlocks.

I was able to reproduce this. I reduced wal_buffers to 64kB, and
NUM_XLOGINSERT_LOCKS to 4 to increase the probability of the deadlock,
and ran a test case as above on my laptop for several hours, and it
finally hung. Will investigate...

Ok, I tracked the bug down to the way LWLockAcquireWithVar, LWLockRelease, and LWLockWaitForVar work. Here's a simplified model of how this happens:

Three backends are needed to cause the deadlock. Let's call them A, B and C. There are two locks, and one of them protects a variable, i.e. is used with LWLockAcquireWithVar et al. The backends run these operations:


A: (checkpointer does this in xlog.c)
LWLockAcquireWithVar(lock1, value1ptr, 0xFF)
LWLockAcquire(lock2);
LWLockRelease(lock1);
LWLockRelease(lock2);

B:
LWLockAcquire(lock2, LW_EXCLUSIVE)
LWLockWaitForVar(lock1, value1ptr, 0, &newval);
LWLockRelease(lock2);

C:
LWLockAcquireWithVar(lock1, value1ptr, 0);
LWLockRelease(lock1)


So, A acquire both locks, in order lock1, lock2. B acquires lock2, and then waits for lock1 to become free or have a non-zero value in the variable. So A and B operate on the locks in opposite order, but this is not supposed to deadlock, because A sets the variable to non-zero, and B waits for it to become non-zero. Then there is a third action, C, that just acquire lock1, with zero value.

This is the sequence that leads to the deadlock:

(both locks are free in the beginning)
C: LWLockAcquireWithVar(lock1, 0). Gets the lock.
A: LWLockAcquireWithVar(lock1, 0xFF). Blocks.
B: LWLockAcuire(lock2). Gets the lock.
B: LWLockWaitForVar(lock 1, 0). Blocks.

C: LWLockRelease(lock1). Wakes up A and B. Sets releaseOK=false because A is waiting for the lock in exclusive mode. C: LWLockAcquireWithVar(lock1, 0). Steals the lock back before A or B have had a chance to run.

B: Wakes up. Observes the lock is still taken, with val 0. Adds itself back to wait queue and goes back to sleep. C: Releases lock 1, releaseOK is false because A has not run yet. Does not wake up anyone.
A: Wakes up. Acquires lock 1 with val 0xFF...
A: Blocks waiting on lock 2.

So the gist of the problem is that LWLockRelease doesn't wake up LW_WAIT_UNTIL_FREE waiters, when releaseOK == false. It should, because a LW_WAIT_UNTIL FREE waiter is now free to run if the variable has changed in value, and it won't steal the lock from the other backend that's waiting to get the lock in exclusive mode, anyway.


I noticed another potential bug: LWLockAcquireCommon doesn't use a volatile pointer when it sets the value of the protected variable:

        /* If there's a variable associated with this lock, initialize it */
        if (valptr)
                *valptr = val;

        /* We are done updating shared state of the lock itself. */
        SpinLockRelease(&lock->mutex);

If the compiler or CPU decides to reorder those two, so that the variable is set after releasing the spinlock, things will break.


The attached patch should fix these two bugs. It is for REL9_4_STABLE; needs to be forward-patched ot master too. This fixes the deadlock for me. Anyone see any issues with this?


Thanks MauMau for the testing!

- Heikki

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 5453549..270af0a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -482,6 +482,7 @@ static inline bool
 LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 {
 	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -637,8 +638,8 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	}
 
 	/* If there's a variable associated with this lock, initialize it */
-	if (valptr)
-		*valptr = val;
+	if (valp)
+		*valp = val;
 
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
@@ -1111,6 +1112,7 @@ LWLockRelease(LWLock *l)
 {
 	volatile LWLock *lock = l;
 	PGPROC	   *head;
+	PGPROC	   *tail = NULL;
 	PGPROC	   *proc;
 	int			i;
 
@@ -1145,63 +1147,70 @@ LWLockRelease(LWLock *l)
 
 	/*
 	 * See if I need to awaken any waiters.  If I released a non-last shared
-	 * hold, there cannot be anything to do.  Also, do not awaken any waiters
-	 * if someone has already awakened waiters that haven't yet acquired the
-	 * lock.
+	 * hold, there cannot be anything to do.
 	 */
 	head = lock->head;
-	if (head != NULL)
+	if (head != NULL && lock->exclusive == 0 && lock->shared == 0)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
+		proc = head;
 
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
+		/*
+		 * First wake up any backends that want to be woken up without
+		 * acquiring the lock. They are woken up regardless of releaseOK.
+		 */
+		while (proc && proc->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		{
+			tail = proc;
+			proc = proc->lwWaitLink;
+		}
 
-			/*
-			 * If the front waiter wants exclusive lock, awaken him only.
-			 * Otherwise awaken as many waiters as want shared access.
-			 */
-			if (proc->lwWaitMode != LW_EXCLUSIVE)
+		/*
+		 * If the front waiter (after any LW_WAIT_UNTIL_FREE waiters) wants
+		 * exclusive lock, awaken him only. Otherwise awaken as many waiters
+		 * as want shared access. But do not awaken any waiters if someone has
+		 * already awakened waiters that haven't yet acquired the lock.
+		 */
+		if (proc && lock->releaseOK)
+		{
+			if (proc->lwWaitMode == LW_EXCLUSIVE)
+				tail = proc;
+			else
 			{
-				while (proc->lwWaitLink != NULL &&
-					   proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
+				while (proc && proc->lwWaitMode != LW_EXCLUSIVE)
 				{
-					if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-						releaseOK = false;
+					tail = proc;
 					proc = proc->lwWaitLink;
 				}
 			}
-			/* proc is now the last PGPROC to be released */
-			lock->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
+		}
+		/*
+		 * 'tail' is now the last PGPROC to be released. Unlink the
+		 * to-be-awakened PGPROCs from the queue.
+		 */
+		if (tail)
+		{
+			lock->head = tail->lwWaitLink;
+			tail->lwWaitLink = NULL;
 
 			/*
 			 * Prevent additional wakeups until retryer gets to run. Backends
 			 * that are just waiting for the lock to become free don't retry
 			 * automatically.
 			 */
-			if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-				releaseOK = false;
-
-			lock->releaseOK = releaseOK;
+			if (tail->lwWaitMode != LW_WAIT_UNTIL_FREE)
+				lock->releaseOK = false;
 		}
 		else
 		{
-			/* lock is still held, can't awaken anything */
+			/* didn't wake up anyone after all */
 			head = NULL;
 		}
 	}
+	else
+	{
+		/* lock is still held, can't awaken anything */
+		head = NULL;
+	}
 
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to