Hi,l

I create fsync v3 v4 v5 patches and test them.

* Changes
 - Add considering about total checkpoint schedule in fsync phase (v3 v4 v5)
 - Add considering about total checkpoint schedule in write phase (v4 only)
 - Modify some implementations from v3 (v5 only)


I use linear combination method for considering about total checkpoint schedule
which are write phase and fsync phase. V3 patch was considered about only fsync
phase, V4 patch was considered about write phase and fsync phase, and v5 patch
was considered about only fsync phase.

Test result is here. Benchmark setting and server are same as previous test. 
'-*'
shows checkpoint_completion_target in each tests. And all tests which are except
'fsync v3_disabled' set 'checkpointer_fsync_delay_ratio=1' and
'checkpointer_fsync_delay_threshold=1000'. 'fsync v3_disabled' set
'checkpointer_fsync_delay_ratio=0' and 'checkpointer_fsync_delay_threshold= -1'.
V5 patch is testing now:-), but it will be same score as v3 patch.

* Result
** DBT-2 result
                     | NOTPM     | 90%tile | Average | S.Deviation | Maximum
---------------------+-----------+---------+---------+-------------+--------
fsync v3-0.7         | 3649.02   | 9.703   | 4.226   | 3.853       | 21.754
fsync v3-0.9         | 3694.41   | 9.897   | 3.874   | 4.016       | 20.774
fsync v3-0.7_disabled| 3583.28   | 10.966  | 4.684   | 4.866       | 31.545
fsync v4-0.7         | 3546.38   | 12.734  | 5.062   | 4.798       | 24.468
fsync v4-0.9         | 3670.81   | 9.864   | 4.130   | 3.665       | 19.236

** Average checkpoint duration (sec) (Not include during loading time)
                     | write_duration | sync_duration | total  | punctual to
checkpoint schedule
---------------------+----------------+---------------+--------+--------------------------------
fsync v3-0.7         | 296.6          | 251.8898      | 548.48 | OK
fsync v3-0.9         | 292.086        | 276.4525      | 568.53 | OK
fsync v3-0.7_disabled| 303.5706       | 155.6116      | 459.18 | OK
fsync v4-0.7         | 273.8338       | 355.6224      | 629.45 | OK
fsync v4-0.9         | 329.0522       | 231.77        | 560.82 | OK

** Increase of checkpoint duration (%) (Reference point is 'fsync 
v3-0.7_disabled'.)
                     | write_duration | sync_duration | total
---------------------+----------------+---------------+-------
fsync v3-0.7         | 97.7%          | 161.9%        | 119.4%
fsync v3-0.9         | 96.2%          | 177.7%        | 123.8%
fsync v3-0.7_disabled| 100.0%         | 100.0%        | 100.0%
fsync v4-0.7         | 90.2%          | 228.5%        | 137.1%
fsync v4-0.9         | 108.4%         | 148.9%        | 122.1%


* Examination
** DBT-2 result
V3 patch seems good result which is be faster response time about 10%-30% and
inclease NOTPM about 5% than no sleep(fsync v3-0.7_disabled), and v4 patch is 
not
good result. However, 'fsync v4-0.9' is same score as v3 patch when more large
checkpoint_completion_target. I think that considering about checkpoint schedule
about write phase and fsync phase makes more harsh in IO schedule. Because write
phase IO schedule is more strict than normal write phase. And it is also bad in
fsync phase and concern latter.

** Average checkpoint duration
All methods are punctual to checkpoint schedule. In enabling fsync sleep, it is
longer fsync time, however total time are much the same as no sleep.
'fsync v4-0.7 ' becomes very bad sync duration and total time. It indicates that
changing checkpoint_completion_target is very delicate. It had not better change
write phase scheduling, the same as it used to. At write phase in normal setting
, it have sufficiently time for punctual to checkpoint schedule. And I think 
that
many user want to be compatible with old version.

What do you think about these patches?

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
  */
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
  * checkpoint, and returns true if the progress we've made this far is greater
  * than the elapsed time/segments.
  */
-static bool
+extern bool
 IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..a09adad 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1828,7 +1828,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags, 0.9);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..ee67edf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
@@ -44,6 +45,9 @@
 #define FSYNCS_PER_ABSORB		10
 #define UNLINKS_PER_ABSORB		10
 
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP		10000
+
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
 
 typedef enum					/* behavior for mdopen & _mdfd_getseg */
 {
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsync(CHECKPOINT_IMMEDIATE, 0.0);
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
-mdsync(void)
+mdsync(int ckpt_flags, double progress_at_begin)
 {
 	static bool mdsync_in_progress = false;
 
@@ -984,6 +990,7 @@ mdsync(void)
 
 	/* Statistics on sync times */
 	int			processed = 0;
+	int			num_to_process;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
+	num_to_process = hash_get_num_entries(pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
@@ -1171,6 +1179,28 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/*
+						 * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+						 * for giving priority to executing transaction.
+						 */
+						if(CheckPointerFsyncDelayThreshold >= 0 &&
+							!shutdown_requested &&
+							!ImmediateCheckpointRequested() &&
+							!(ckpt_flags & CHECKPOINT_FORCE) &&
+							!(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+							(elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+							IsCheckpointOnSchedule(progress_at_begin + (1.0 - progress_at_begin) * (double) processed / num_to_process))
+						{
+							double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+							/* Too long sleep is not good for checkpoint scheduler */
+							if(fsync_sleep > MAX_FSYNC_SLEEP)
+								fsync_sleep = MAX_FSYNC_SLEEP;
+							pg_usleep(fsync_sleep * 1000L);
+							if(log_checkpoints)
+								elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+									fsync_sleep);
+						}
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..6a5cc0d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
+	void		(*smgr_sync) (int ckpt_flags, double progress_at_begin);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
 
@@ -708,14 +708,18 @@ smgrpreckpt(void)
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
-smgrsync(void)
+smgrsync(int ckpt_flags, double progress_at_begin)
 {
 	int			i;
 
+	/*
+	 * XXX: If we ever have more than one smgr, the remaining progress
+	 * should somehow be divided among all smgrs.
+	 */
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_sync)
-			(*(smgrsw[i].smgr_sync)) ();
+			(*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin);
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&CheckPointerFsyncDelayThreshold,
+		-1, -1, 1000000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+		NULL
+		},
+		&CheckPointerFsyncDelayRatio,
+		0.0, 0.0, 2.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpointer_fsync_delay_ratio = 0.0	# range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 	# range 0 - 1000000 milliseconds. -1 is disable.
 
 # - Archiving -
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
 extern int	BgWriterDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
+extern int	CheckPointerFsyncDelayThreshold;
 extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
 
 extern void BackgroundWriterMain(void) __attribute__((noreturn));
 extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..e8efcbe 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double progress_at_begin);
 extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double progress_at_begin);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
 	RESOURCES_KERNEL,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
+	RESOURCES_CHECKPOINTER,
 	RESOURCES_ASYNCHRONOUS,
 	WAL,
 	WAL_SETTINGS,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
  */
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
  * checkpoint, and returns true if the progress we've made this far is greater
  * than the elapsed time/segments.
  */
-static bool
+extern bool
 IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..9f4177a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,6 +66,9 @@
 
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
+/* Checkpoint schedule ratio of write phase to fsync phase */
+#define CHECKPOINT_SCHEDULE_RATIO		0.9
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -94,7 +97,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
-static void BufferSync(int flags);
+static void BufferSync(int flags, double ckpt_schedule_ratio);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
 static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
@@ -1207,7 +1210,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
  * remaining flags currently have no effect here.
  */
 static void
-BufferSync(int flags)
+BufferSync(int flags, double ckpt_schedule_ratio)
 {
 	int			buf_id;
 	int			num_to_scan;
@@ -1319,7 +1322,7 @@ BufferSync(int flags)
 				/*
 				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, ckpt_schedule_ratio * (double) num_written / num_to_write);
 			}
 		}
 
@@ -1825,10 +1828,10 @@ CheckPointBuffers(int flags)
 {
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
 	CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
-	BufferSync(flags);
+	BufferSync(flags, CHECKPOINT_SCHEDULE_RATIO);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags, CHECKPOINT_SCHEDULE_RATIO);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..9809fb1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
@@ -44,6 +45,9 @@
 #define FSYNCS_PER_ABSORB		10
 #define UNLINKS_PER_ABSORB		10
 
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP		10000
+
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
 
 typedef enum					/* behavior for mdopen & _mdfd_getseg */
 {
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsync(CHECKPOINT_IMMEDIATE, 0.0);
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
-mdsync(void)
+mdsync(int ckpt_flags, double ckpt_schedule_ratio)
 {
 	static bool mdsync_in_progress = false;
 
@@ -984,6 +990,7 @@ mdsync(void)
 
 	/* Statistics on sync times */
 	int			processed = 0;
+	int			num_to_process;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
+	num_to_process = hash_get_num_entries(pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
@@ -1171,6 +1179,29 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/*
+						 * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+						 * for giving priority to executing transaction.
+						 */
+						if(CheckPointerFsyncDelayThreshold >= 0 &&
+							!shutdown_requested &&
+							!ImmediateCheckpointRequested() &&
+							!(ckpt_flags & CHECKPOINT_IMMEDIATE) &&
+							!(ckpt_flags & CHECKPOINT_FORCE) &&
+							!(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+							(elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+							IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process))
+						{
+							double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+							/* Too long sleep is not good for checkpoint scheduler */
+							if(fsync_sleep > MAX_FSYNC_SLEEP)
+								fsync_sleep = MAX_FSYNC_SLEEP;
+							pg_usleep(fsync_sleep * 1000L);
+							if(log_checkpoints)
+								elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+									fsync_sleep);
+						}
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..e704b52 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
+	void		(*smgr_sync) (int ckpt_flags, double progress_at_begin);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
 
@@ -708,14 +708,18 @@ smgrpreckpt(void)
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
-smgrsync(void)
+smgrsync(int ckpt_flags, double ckpt_schedule_ratio)
 {
 	int			i;
 
+	/*
+	 * XXX: If we ever have more than one smgr, the remaining progress
+	 * should somehow be divided among all smgrs.
+	 */
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_sync)
-			(*(smgrsw[i].smgr_sync)) ();
+			(*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio);
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&CheckPointerFsyncDelayThreshold,
+		-1, -1, 1000000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+		NULL
+		},
+		&CheckPointerFsyncDelayRatio,
+		0.0, 0.0, 2.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..707b433 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpointer_fsync_delay_ratio = 0.0	# range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 	# range 0 - 1000000 milliseconds. -1 is disable.
 
 # - Archiving -
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
 extern int	BgWriterDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
+extern int	CheckPointerFsyncDelayThreshold;
 extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
 
 extern void BackgroundWriterMain(void) __attribute__((noreturn));
 extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..d68b950 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio);
 extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
 	RESOURCES_KERNEL,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
+	RESOURCES_CHECKPOINTER,
 	RESOURCES_ASYNCHRONOUS,
 	WAL,
 	WAL_SETTINGS,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..d09fe4f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem;
  */
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time;
 /* Prototypes for private functions */
 
 static void CheckArchiveTimeout(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +643,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
@@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress)
  * checkpoint, and returns true if the progress we've made this far is greater
  * than the elapsed time/segments.
  */
-static bool
+extern bool
 IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..93a879a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,6 +66,9 @@
 
 #define DROP_RELS_BSEARCH_THRESHOLD		20
 
+/* Checkpoint schedule ratio of write phase to fsync phase */
+#define CKPT_SCHEDULE_RATIO		0.9
+
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -1828,7 +1831,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	smgrsync(flags, CKPT_SCHEDULE_RATIO);
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..9809fb1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
@@ -44,6 +45,9 @@
 #define FSYNCS_PER_ABSORB		10
 #define UNLINKS_PER_ABSORB		10
 
+/* Protect too long sleep in each file fsync. */
+#define MAX_FSYNC_SLEEP		10000
+
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
@@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL;
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
 
 typedef enum					/* behavior for mdopen & _mdfd_getseg */
 {
@@ -235,7 +241,7 @@ SetForwardFsyncRequests(void)
 	/* Perform any pending fsyncs we may have queued up, then drop table */
 	if (pendingOpsTable)
 	{
-		mdsync();
+		mdsync(CHECKPOINT_IMMEDIATE, 0.0);
 		hash_destroy(pendingOpsTable);
 	}
 	pendingOpsTable = NULL;
@@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
  *	mdsync() -- Sync previous writes to stable storage.
  */
 void
-mdsync(void)
+mdsync(int ckpt_flags, double ckpt_schedule_ratio)
 {
 	static bool mdsync_in_progress = false;
 
@@ -984,6 +990,7 @@ mdsync(void)
 
 	/* Statistics on sync times */
 	int			processed = 0;
+	int			num_to_process;
 	instr_time	sync_start,
 				sync_end,
 				sync_diff;
@@ -1052,6 +1059,7 @@ mdsync(void)
 	/* Now scan the hashtable for fsync requests to process */
 	absorb_counter = FSYNCS_PER_ABSORB;
 	hash_seq_init(&hstat, pendingOpsTable);
+	num_to_process = hash_get_num_entries(pendingOpsTable);
 	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 	{
 		ForkNumber	forknum;
@@ -1171,6 +1179,29 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/*
+						 * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+						 * for giving priority to executing transaction.
+						 */
+						if(CheckPointerFsyncDelayThreshold >= 0 &&
+							!shutdown_requested &&
+							!ImmediateCheckpointRequested() &&
+							!(ckpt_flags & CHECKPOINT_IMMEDIATE) &&
+							!(ckpt_flags & CHECKPOINT_FORCE) &&
+							!(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) &&
+							(elapsed / 1000 > CheckPointerFsyncDelayThreshold) &&
+							IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process))
+						{
+							double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio;
+
+							/* Too long sleep is not good for checkpoint scheduler */
+							if(fsync_sleep > MAX_FSYNC_SLEEP)
+								fsync_sleep = MAX_FSYNC_SLEEP;
+							pg_usleep(fsync_sleep * 1000L);
+							if(log_checkpoints)
+								elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+									fsync_sleep);
+						}
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..da68900 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,7 +58,7 @@ typedef struct f_smgr
 											  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
+	void		(*smgr_sync) (int ckpt_flags, double ckpt_schedule_ratio);	/* may be NULL */
 	void		(*smgr_post_ckpt) (void);		/* may be NULL */
 } f_smgr;
 
@@ -708,14 +708,18 @@ smgrpreckpt(void)
  *	smgrsync() -- Sync files to disk during checkpoint.
  */
 void
-smgrsync(void)
+smgrsync(int ckpt_flags, double ckpt_schedule_ratio)
 {
 	int			i;
 
+	/*
+	 * XXX: If we ever have more than one smgr, the remaining progress
+	 * should somehow be divided among all smgrs.
+	 */
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_sync)
-			(*(smgrsw[i].smgr_sync)) ();
+			(*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio);
 	}
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..a240c43 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+			gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&CheckPointerFsyncDelayThreshold,
+		-1, -1, 1000000,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+		NULL
+		},
+		&CheckPointerFsyncDelayRatio,
+		0.0, 0.0, 2.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..b4b3a9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -186,6 +186,8 @@
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpointer_fsync_delay_ratio = 0.0	# range 0.0 - 2.0
+#checkpointer_fsync_delay_threshold = -1 	# range 0 - 1000000 milliseconds. -1 is disable.
 
 # - Archiving -
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..ab266d6 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,7 +23,9 @@
 extern int	BgWriterDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
+extern int	CheckPointerFsyncDelayThreshold;
 extern double CheckPointCompletionTarget;
+extern double CheckPointerFsyncDelayRatio;
 
 extern void BackgroundWriterMain(void) __attribute__((noreturn));
 extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
+extern bool ImmediateCheckpointRequested(void);
+extern bool IsCheckpointOnSchedule(double progress);
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..d68b950 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
-extern void smgrsync(void);
+extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio);
 extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
@@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
-extern void mdsync(void);
+extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
 	RESOURCES_KERNEL,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
+	RESOURCES_CHECKPOINTER,
 	RESOURCES_ASYNCHRONOUS,
 	WAL,
 	WAL_SETTINGS,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to