On Wed, Aug 3, 2011 at 11:21 AM, Robert Haas <robertmh...@gmail.com> wrote:
> About nine months ago, we had a discussion of some benchmarking that
> was done by the mosbench folks at MIT:
>
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php
>
> Although the authors used PostgreSQL as a test harness for driving
> load, it's pretty clear from reading the paper that their primary goal
> was to stress the Linux kernel, so the applicability of the paper to
> real-world PostgreSQL performance improvement is less than it might
> be.  Still, having now actually investigated in some detail many of
> the same performance issues that they were struggling with, I have a
> much clearer understanding of what's really going on here.  In
> PostgreSQL terms, here are the bottlenecks they ran into:
>
> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
> because PostgreSQL protects its free-list with a single lock and thus
> scales poorly with smaller caches."  This is a complaint about
> BufFreeList lock which, in fact, I've seen as a huge point of
> contention on some workloads.  In fact, on read-only workloads, with
> my lazy vxid lock patch applied, this is, I believe, the only
> remaining unpartitioned LWLock that is ever taken in exclusive mode;
> or at least the only one that's taken anywhere near often enough to
> matter.  I think we're going to do something about this, although I
> don't have a specific idea in mind at the moment.

I was going to ask if you if had done any benchmarks with scale such
that the tables fit in RAM but not in shared_buffers.  I guess you
have.

The attached experimental patch fixed freelist contention on 8 cores.
It would be nice to see what happens above that.

It has been cherry picked up to HEAD, but not tested against it. (Last
tested in Dec 2010, my how time flies)

The approach is to move the important things from a LWLock to a
spinlock, and to not do any locking for increments to clock-hand
increment and numBufferAllocs.
That means that some buffers might occasionally get inspected twice
and some might not get inspected at all during any given clock cycle,
but this should not lead to any correctness problems.   (Disclosure:
Tom didn't like this approach when it was last discussed.)

I just offer this for whatever it is worth to you--I'm not proposing
it as an actual patch to be applied.

When data fits in RAM but not shared_buffers, maybe the easiest fix is
to increase shared_buffers.  Which brings up the other question I had
for you about your work with Nate's celebrated loaner machine.  Have
you tried to reproduce the performance problems that have been
reported (but without public disclosure of how to reproduce) with
shared_buffers > 8GB on machines with RAM >>8GB ?

Cheers,

Jeff
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bf9903b..304135e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -28,7 +28,8 @@ typedef struct
 	int			nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			lastFreeBuffer;		 /* Tail of list of unused buffers */
+	slock_t			mutex;			/* Protects all of these, except sometimes not nextVictimBuffer */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -44,7 +45,7 @@ typedef struct
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -123,15 +124,13 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
 	 * the rate of buffer consumption.	Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
+	 * This is now done without a lock, and so updates can be lost
 	 */
+      
 	StrategyControl->numBufferAllocs++;
 
 	/*
@@ -142,6 +141,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 	 */
 	while (StrategyControl->firstFreeBuffer >= 0)
 	{
+		SpinLockAcquire(&StrategyControl->mutex);
+		if (StrategyControl->firstFreeBuffer<0)
+		{
+			SpinLockRelease(&StrategyControl->mutex);
+			break;
+		}
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
@@ -149,6 +154,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
 
+		SpinLockRelease(&StrategyControl->mutex);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; discard it and retry.  (This can only happen if VACUUM put a
@@ -161,21 +168,32 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		{
 			if (strategy != NULL)
 				AddBufferToRing(strategy, buf);
+			*lock_held = false;
 			return buf;
 		}
 		UnlockBufHdr(buf);
 	}
+	*lock_held = false;
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+
+		int localVictim=StrategyControl->nextVictimBuffer;
+		if (localVictim >= NBuffers) localVictim=0;
+
+		buf = &BufferDescriptors[localVictim];
 
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
 		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			SpinLockAcquire(&StrategyControl->mutex);
+			if (StrategyControl->nextVictimBuffer >= NBuffers) 
+			{
+				StrategyControl->nextVictimBuffer = 0;
+				StrategyControl->completePasses++;
+			};
+			SpinLockRelease(&StrategyControl->mutex);
 		}
 
 		/*
@@ -223,7 +241,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->mutex);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -237,7 +255,7 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->mutex);
 }
 
 /*
@@ -256,8 +274,9 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->mutex);
 	result = StrategyControl->nextVictimBuffer;
+	if (result >= NBuffers) result=0;
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
 	if (num_buf_alloc)
@@ -265,7 +284,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
 	}
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->mutex);
 	return result;
 }
 
@@ -337,6 +356,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		SpinLockInit(&StrategyControl->mutex);
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to