Ron Mayer wrote:
Might smoother checkpoints be better solved by talking
to the OS vendors & virtual-memory-tunning-knob-authors
to work with them on exposing the ideal knobs; rather than
saying that our only tool is a hammer(fsync) so the problem
must be handled as a nail.

Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done.

I've attached an updated version of the initial sync spreading patch here, one that applies cleanly on top of HEAD and over top of the sync instrumentation patch too. The conflict that made that hard before is gone now.

Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the time now has made me reconsider how important one potential bit of refactoring here would be. I managed to catch one of the situations where really popular relations were being heavily updated in a way that was competing with the checkpoint on my test system (which I can happily share the logs of), with the instrumentation patch applied but not the spread sync one:

LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm time=10132.830000 msec
DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 msec LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s

Note here how the checkpoint was hung on trying to get 16448_fsm written out, but the backends were issuing constant competing fsync calls to these other relations. This is very similar to the production case this patch was written to address, which I hadn't been able to share a good example of yet. That's essentially what it looks like, except with the contention going on for minutes instead of seconds.

One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct
 
 static BgWriterShmemStruct *BgWriterShmem;
 
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR	10
 
 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(false);
 
 		if (got_SIGHUP)
 		{
@@ -636,7 +636,7 @@ BgWriterNap(void)
 		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
+		AbsorbFsyncRequests(true);
 		udelay -= 1000000L;
 	}
 
@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
 	/* Do nothing if checkpoint is being executed by non-bgwriter process */
 	if (!am_bg_writer)
 		return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
 			ProcessConfigFile(PGC_SIGHUP);
 		}
 
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
 
 		BgBufferSync();
 		CheckArchiveTimeout();
 		BgWriterNap();
 	}
-	else if (--absorb_counter <= 0)
+	else
 	{
 		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
+		 * Check for overflow of the fsync request queue.
 		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
+		AbsorbFsyncRequests(false);
+	}
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+	pg_time_t	now;
+ 	pg_time_t	sync_start_time;
+ 	int			sync_delay_secs;
+ 
+ 	/*
+ 	 * Delay after each sync, in seconds.  This could be a parameter.  But
+ 	 * since ideally this will be auto-tuning in the near future, not
+	 * assigning it a GUC setting yet.
+ 	 */
+#define EXTRA_SYNC_DELAY	3
+
+	/* Do nothing if checkpoint is being executed by non-bgwriter process */
+	if (!am_bg_writer)
+		return;
+
+ 	sync_start_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Perform the usual bgwriter duties.
+	 */
+ 	for (;;)
+ 	{
+		AbsorbFsyncRequests(false);
+ 		BgBufferSync();
+ 		CheckArchiveTimeout();
+ 		BgWriterNap();
+ 
+ 		/*
+ 		 * Are we there yet?
+ 		 */
+ 		now = (pg_time_t) time(NULL);
+ 		sync_delay_secs = now - sync_start_time;
+ 		if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+			break;
 	}
 }
 
@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
 	BgWriterRequest *requests = NULL;
 	BgWriterRequest *request;
 	int			n;
 
+	/* 
+	 * Divide the size of the request queue by this to determine when
+	 * absorption action needs to be taken.  Default here aims to empty the
+	 * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+	 * you probably need to lower bgwriter_delay, rather than presume
+	 * this needs to be a tunable you can decrease.
+	 */
+	int			absorb_action_divisor = 10;
+
 	if (!am_bg_writer)
 		return;
 
 	/*
+	 * If the queue isn't very large, don't worry about absorbing yet.
+	 * Access integer counter without lock, to avoid queuing.
+	 */
+	if (!force && BgWriterShmem->num_requests < 
+			(BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+	{	
+		if (BgWriterShmem->num_requests > 0)
+			elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+				BgWriterShmem->num_requests);
+		return;
+	}
+
+	elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+		BgWriterShmem->num_requests);
+
+	/*
 	 * We have to PANIC if we fail to absorb all the pending requests (eg,
 	 * because our hashtable runs out of memory).  This is because the system
 	 * cannot run safely if we are unable to fsync what we have been told to
@@ -1164,4 +1230,9 @@ AbsorbFsyncRequests(void)
 		pfree(requests);
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Send off activity statistics to the stats collector
+	 */
+	pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index cadd938..c89486e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,9 +31,6 @@
 #include "pg_trace.h"
 
 
-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB		10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
@@ -926,7 +923,6 @@ mdsync(void)
 
 	HASH_SEQ_STATUS hstat;
 	PendingOperationEntry *entry;
-	int			absorb_counter;
 
 	/* Statistics on sync times */
 	int processed = 0;
@@ -951,7 +947,7 @@ mdsync(void)
 	 * queued an fsync request before clearing the buffer's dirtybit, so we
 	 * are safe as long as we do an Absorb after completing BufferSync().
 	 */
-	AbsorbFsyncRequests();
+	AbsorbFsyncRequests(true);
 
 	/*
 	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -994,7 +990,6 @@ mdsync(void)
 	mdsync_in_progress = true;
 
 	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
@@ -1019,17 +1014,9 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
-			 * often to prevent overflow of the fsync request queue.  It is
-			 * unspecified whether newly-added entries will be visited by
-			 * hash_seq_search, but we don't care since we don't need to
-			 * process them anyway.
+			 * If in bgwriter, perform normal duties.
 			 */
-			if (--absorb_counter <= 0)
-			{
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;
-			}
+			CheckpointSyncDelay();
 
 			/*
 			 * The fsync table could contain requests to fsync segments that
@@ -1121,10 +1108,9 @@ mdsync(void)
 				pfree(path);
 
 				/*
-				 * Absorb incoming requests and check to see if canceled.
+				 * If in bgwriter, perform normal duties.
 				 */
-				AbsorbFsyncRequests();
-				absorb_counter = FSYNCS_PER_ABSORB;		/* might as well... */
+				CheckpointSyncDelay();
 
 				if (entry->canceled)
 					break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);
 
 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
 					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);
 
 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to