Hi,

I create patch which is improvement of checkpoint IO scheduler for stable transaction responses.

* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint scheduler has two problems at start and end of checkpoint. One problem is IO heavy when starting initial checkpoint in rounds of checkpoint. This problem was caused by full-page-write which cause WAL IO in fast page writes after checkpoint write page. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrong judgment that is late schedule by full-page-write, nevertheless checkpoint schedule is not late. This is caused bad transaction response. I think WAL-based checkpoint scheduler was not property in starting checkpoint. Second problem is fsync freeze problem in end of checkpoint. Normally, checkpoint write is executed in background by OS's IO scheduler. But when it does not correctly work, end of checkpoint fsync was caused IO freeze and slower transactions. Unexpected slow transaction will cause monitor error in HA-cluster and decrease user-experience in application service. It is especially serious problem in cloud and virtual server database system which does not have IO performance. However we don't have solution in postgresql.conf parameter very much. We prefer checkpoint time to fast response transactions. In fact checkpoint time is short, and it becomes little bit long that is not problem. You may think that checkpoint_segments and checkpoint_timeout are set larger value, however large checkpoint_segments affects file-cache which is not read and is wasted, and large checkpoint_timeout was caused long-time crash-recovery.


* Improvement method of checkpoint IO scheduler
1. Improvement full-page-write IO heavy problem in start of checkpoint
My idea is very simple. When start of checkpoint, checkpoint_completion_target become more loose. I set three parameter of this issue; 'checkpoint_smooth_target', 'checkpoint_smooth_margin' and 'checkpointer_write_delay'. 'checkpointer_smooth_target' parameter is a term point that is smooth checkpoint IO schedule in checkpoint progress. 'checkpoint_smooth_margin' parameter can be more smooth checkpoint schedule. It is heuristic parameter, but it solves this problem effectively. 'checkpointer_write_delay' parameter is sleep time for checkpoint schedule. This parameter is nearly same 'bgwriter_delay' in PG9.1 older.
 If you want to get more detail information, please see attached patch.

2. Improvement fsync freeze problem in end of checkpoint
When fsync freeze problem was happened, file fsync more repeatedly is meaningless and causes stop transactions. So I think, if fsync executing time was long, IO queue is flooded and should give IO priority to transactions for fast response time. It realize by inserting sleep time during fsync when fsync time was long. It seems to be long time in checkpoint, but it is not very long. In fact, when fsync time is long, IO queue is packed by another IO which is included checkpoint writes, it only gives IO priority to another executing transactions. I tested my patch in DBT-2 benchmark. Please see result of test. My patch realize higher transaction and fast response than plain PG. Checkpoint time is little bit longer than plain PG, but it is not serious.


* Result of DBT-2 with this patch. (Compared with original PG9.2.4)
I use DBT-2 benchmark software by OSDL. I also use pg_statsinfo and pg_stats_reporter in this benchmark.

  - Patched PG (patched 9.2.4)
    DBT-2 result:     http://goo.gl/1PD3l
    statsinfo report: http://goo.gl/UlGAO
    settings:         http://goo.gl/X4Whu

  - Original PG (9.2.4)
    DBT-2 result:     http://goo.gl/XVxtj
    statsinfo report: http://goo.gl/UT1Li
    settings:         http://goo.gl/eofmb

Measurement Value is improved 4%, 'new-order 90%tile' is improved 20%, 'new-order average' is improved 18%, 'new-order deviation' is improved 24%, and 'new-order maximum' is improved 27%. I confirm high throughput and WAL IO at executing checkpoint in pg_stats_reporter's report. My patch realizes high response transactions and non-blocking executing transactions.

Bad point of my patch is longer checkpoint. Checkpoint time was increased about 10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).

* Test server
  Server: HP Proliant DL360 G7
  CPU:    Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB


It is not advertisement of pg_statsinfo and pg_stats_reporter:-) They are free software. If you have comment and another idea about my patch, please send me.

Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..a66ce36 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,16 +141,21 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /*
  * GUC parameters
  */
+int			CheckPointerWriteDelay = 200;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointSmoothTarget = 0.0;
+double		CheckPointSmoothMargin = 0.0;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -169,7 +174,6 @@ static pg_time_t last_xlog_switch_time;
 
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +647,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
@@ -715,7 +719,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(100000L);
+		pg_usleep(CheckPointerWriteDelay * 1000L);
 	}
 	else if (--absorb_counter <= 0)
 	{
@@ -742,14 +746,35 @@ IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
 	struct timeval now;
-	double		elapsed_xlogs,
+	double		original_progress,
+			elapsed_xlogs,
 				elapsed_time;
 
 	Assert(ckpt_active);
 
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
+	/* This variable is used by smooth checkpoint schedule.*/
+	original_progress = progress * CheckPointCompletionTarget;
+	
+	/* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+	if(progress >= CheckPointSmoothTarget)
+	{
+		/* Normal checkpoint schedule. */
+		progress *= CheckPointCompletionTarget;
+	}
+	else
+	{
+		/* Smooth checkpoint schedule. 
+ 		 *	 
+ 		 * When initial checkpoint, it tends to be high IO road average 
+ 		 * and slow executing transactions. This schedule reduces them 
+ 		 * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget, 
+ 		 * it becomes near normal checkpoint schedule. If you want to more 
+ 		 * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+		 */ 		
+		progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) * 
+				(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget)
+				 + CheckPointCompletionTarget;
+	}
 	/*
 	 * Check against the cached value first. Only do the more expensive
 	 * calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
 			ckpt_cached_elapsed = elapsed_xlogs;
 			return false;
 		}
+		else if (original_progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+
+			/* smooth checkpoint write */
+			pg_usleep(CheckPointerWriteDelay * 1000L);
+			return false;
+		}
 	}
 
 	/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
+	else if (original_progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		
+		/* smooth checkpoint write */
+		pg_usleep(CheckPointerWriteDelay * 1000L);
+		return false;
+	}
 
 	/* It looks like we're on schedule. */
 	return true;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..e558eb7 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
@@ -162,6 +163,8 @@ static List *pendingUnlinks = NIL;
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
 
 typedef enum					/* behavior for mdopen & _mdfd_getseg */
 {
@@ -1171,6 +1174,18 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/* 
+						 * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' 
+ 						 * for giving priority to executing transaction.
+ 						 */
+						if( CheckPointerFsyncDelayThreshold >= 0 &&
+							!shutdown_requested &&
+							!ImmediateCheckpointRequested() &&
+							(elapsed / 1000 > CheckPointerFsyncDelayThreshold)){
+							pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L);
+							elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+                                                                 (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+							}
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..f3fa5ab 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,30 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+                {"checkpointer_write_delay", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("checkpointer sleep time during dirty buffers write in checkpoint."),
+                        NULL,
+			GUC_UNIT_MS
+                },
+                &CheckPointerWriteDelay,
+                200, 10, 10000,
+                NULL, NULL, NULL
+        },
+
+        {
+                {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+                        NULL,
+			GUC_UNIT_MS
+                },
+                &CheckPointerFsyncDelayThreshold,
+                -1, -1, 1000000,
+                NULL, NULL, NULL
+        },
+
+
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -2551,6 +2575,36 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+        {
+                {"checkpoint_smooth_target", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("Smooth control IO load between starting checkpoint and this target parameter in progress of checkpoint."),
+                        NULL
+                },
+                &CheckPointSmoothTarget,
+                0.0, 0.0, 1.0,
+                NULL, NULL, NULL
+        },
+
+	{
+		{"checkpoint_smooth_margin", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("More smooth control IO load between starting checkpoint and checkpoint_smooth_target."),
+		NULL
+		},
+		&CheckPointSmoothMargin,
+		0.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+		NULL
+		},
+		&CheckPointerFsyncDelayRatio,
+		0.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..9c07bd8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -185,7 +185,12 @@
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_smooth_target = 0.0		# smooth checkpoint target, 0.0 - 1.0
+#checkpoint_smooth_margin = 0.0		# smooth checkpoint margin, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpointer_write_delay = 200ms	# 10-10000 milliseconds
+#checkpointer_fsync_delay_ratio = 0.0	# range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 	# range 0 - 1000000 milliseconds. -1 is disable.
 
 # - Archiving -
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..5964b99 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,9 +21,14 @@
 
 /* GUC options */
 extern int	BgWriterDelay;
+extern int	CheckPointerWriteDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
+extern int	CheckPointerFsyncDelayThreshold;
 extern double CheckPointCompletionTarget;
+extern double CheckPointSmoothTarget;
+extern double CheckPointSmoothMargin;
+extern double CheckPointerFsyncDelayRatio;
 
 extern void BackgroundWriterMain(void) __attribute__((noreturn));
 extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +36,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
+extern bool ImmediateCheckpointRequested(void);
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
 	RESOURCES_KERNEL,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
+	RESOURCES_CHECKPOINTER,
 	RESOURCES_ASYNCHRONOUS,
 	WAL,
 	WAL_SETTINGS,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to