Hi,l I create fsync v3 v4 v5 patches and test them.
* Changes - Add considering about total checkpoint schedule in fsync phase (v3 v4 v5) - Add considering about total checkpoint schedule in write phase (v4 only) - Modify some implementations from v3 (v5 only) I use linear combination method for considering about total checkpoint schedule which are write phase and fsync phase. V3 patch was considered about only fsync phase, V4 patch was considered about write phase and fsync phase, and v5 patch was considered about only fsync phase. Test result is here. Benchmark setting and server are same as previous test. '-*' shows checkpoint_completion_target in each tests. And all tests which are except 'fsync v3_disabled' set 'checkpointer_fsync_delay_ratio=1' and 'checkpointer_fsync_delay_threshold=1000'. 'fsync v3_disabled' set 'checkpointer_fsync_delay_ratio=0' and 'checkpointer_fsync_delay_threshold= -1'. V5 patch is testing now:-), but it will be same score as v3 patch. * Result ** DBT-2 result | NOTPM | 90%tile | Average | S.Deviation | Maximum ---------------------+-----------+---------+---------+-------------+-------- fsync v3-0.7 | 3649.02 | 9.703 | 4.226 | 3.853 | 21.754 fsync v3-0.9 | 3694.41 | 9.897 | 3.874 | 4.016 | 20.774 fsync v3-0.7_disabled| 3583.28 | 10.966 | 4.684 | 4.866 | 31.545 fsync v4-0.7 | 3546.38 | 12.734 | 5.062 | 4.798 | 24.468 fsync v4-0.9 | 3670.81 | 9.864 | 4.130 | 3.665 | 19.236 ** Average checkpoint duration (sec) (Not include during loading time) | write_duration | sync_duration | total | punctual to checkpoint schedule ---------------------+----------------+---------------+--------+-------------------------------- fsync v3-0.7 | 296.6 | 251.8898 | 548.48 | OK fsync v3-0.9 | 292.086 | 276.4525 | 568.53 | OK fsync v3-0.7_disabled| 303.5706 | 155.6116 | 459.18 | OK fsync v4-0.7 | 273.8338 | 355.6224 | 629.45 | OK fsync v4-0.9 | 329.0522 | 231.77 | 560.82 | OK ** Increase of checkpoint duration (%) (Reference point is 'fsync v3-0.7_disabled'.) | write_duration | sync_duration | total ---------------------+----------------+---------------+------- fsync v3-0.7 | 97.7% | 161.9% | 119.4% fsync v3-0.9 | 96.2% | 177.7% | 123.8% fsync v3-0.7_disabled| 100.0% | 100.0% | 100.0% fsync v4-0.7 | 90.2% | 228.5% | 137.1% fsync v4-0.9 | 108.4% | 148.9% | 122.1% * Examination ** DBT-2 result V3 patch seems good result which is be faster response time about 10%-30% and inclease NOTPM about 5% than no sleep(fsync v3-0.7_disabled), and v4 patch is not good result. However, 'fsync v4-0.9' is same score as v3 patch when more large checkpoint_completion_target. I think that considering about checkpoint schedule about write phase and fsync phase makes more harsh in IO schedule. Because write phase IO schedule is more strict than normal write phase. And it is also bad in fsync phase and concern latter. ** Average checkpoint duration All methods are punctual to checkpoint schedule. In enabling fsync sleep, it is longer fsync time, however total time are much the same as no sleep. 'fsync v4-0.7 ' becomes very bad sync duration and total time. It indicates that changing checkpoint_completion_target is very delicate. It had not better change write phase scheduling, the same as it used to. At write phase in normal setting , it have sufficiently time for punctual to checkpoint schedule. And I think that many user want to be compatible with old version. What do you think about these patches? Best regards, -- Mitsumasa KONDO NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index fdf6625..d09fe4f 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem; */ int CheckPointTimeout = 300; int CheckPointWarning = 30; +int CheckPointerFsyncDelayThreshold = -1; double CheckPointCompletionTarget = 0.5; +double CheckPointerFsyncDelayRatio = 0.0; /* * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t got_SIGHUP = false; -static volatile sig_atomic_t checkpoint_requested = false; -static volatile sig_atomic_t shutdown_requested = false; +extern volatile sig_atomic_t checkpoint_requested = false; +extern volatile sig_atomic_t shutdown_requested = false; /* * Private state @@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time; /* Prototypes for private functions */ static void CheckArchiveTimeout(void); -static bool IsCheckpointOnSchedule(double progress); -static bool ImmediateCheckpointRequested(void); static bool CompactCheckpointerRequestQueue(void); static void UpdateSharedMemoryConfig(void); @@ -643,7 +643,7 @@ CheckArchiveTimeout(void) * this does not check the *current* checkpoint's IMMEDIATE flag, but whether * there is one pending behind it.) */ -static bool +extern bool ImmediateCheckpointRequested(void) { if (checkpoint_requested) @@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress) * checkpoint, and returns true if the progress we've made this far is greater * than the elapsed time/segments. */ -static bool +extern bool IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8079226..a09adad 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1828,7 +1828,7 @@ CheckPointBuffers(int flags) BufferSync(flags); CheckpointStats.ckpt_sync_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); - smgrsync(); + smgrsync(flags, 0.9); CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index e629181..ee67edf 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -21,6 +21,7 @@ */ #include "postgres.h" +#include <signal.h> #include <unistd.h> #include <fcntl.h> #include <sys/file.h> @@ -44,6 +45,9 @@ #define FSYNCS_PER_ABSORB 10 #define UNLINKS_PER_ABSORB 10 +/* Protect too long sleep in each file fsync. */ +#define MAX_FSYNC_SLEEP 10000 + /* * Special values for the segno arg to RememberFsyncRequest. * @@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL; static CycleCtr mdsync_cycle_ctr = 0; static CycleCtr mdckpt_cycle_ctr = 0; +extern volatile sig_atomic_t checkpoint_requested; +extern volatile sig_atomic_t shutdown_requested; typedef enum /* behavior for mdopen & _mdfd_getseg */ { @@ -235,7 +241,7 @@ SetForwardFsyncRequests(void) /* Perform any pending fsyncs we may have queued up, then drop table */ if (pendingOpsTable) { - mdsync(); + mdsync(CHECKPOINT_IMMEDIATE, 0.0); hash_destroy(pendingOpsTable); } pendingOpsTable = NULL; @@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) * mdsync() -- Sync previous writes to stable storage. */ void -mdsync(void) +mdsync(int ckpt_flags, double progress_at_begin) { static bool mdsync_in_progress = false; @@ -984,6 +990,7 @@ mdsync(void) /* Statistics on sync times */ int processed = 0; + int num_to_process; instr_time sync_start, sync_end, sync_diff; @@ -1052,6 +1059,7 @@ mdsync(void) /* Now scan the hashtable for fsync requests to process */ absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); + num_to_process = hash_get_num_entries(pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { ForkNumber forknum; @@ -1171,6 +1179,28 @@ mdsync(void) FilePathName(seg->mdfd_vfd), (double) elapsed / 1000); + /* + * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' + * for giving priority to executing transaction. + */ + if(CheckPointerFsyncDelayThreshold >= 0 && + !shutdown_requested && + !ImmediateCheckpointRequested() && + !(ckpt_flags & CHECKPOINT_FORCE) && + !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) && + (elapsed / 1000 > CheckPointerFsyncDelayThreshold) && + IsCheckpointOnSchedule(progress_at_begin + (1.0 - progress_at_begin) * (double) processed / num_to_process)) + { + double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio; + + /* Too long sleep is not good for checkpoint scheduler */ + if(fsync_sleep > MAX_FSYNC_SLEEP) + fsync_sleep = MAX_FSYNC_SLEEP; + pg_usleep(fsync_sleep * 1000L); + if(log_checkpoints) + elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec", + fsync_sleep); + } break; /* out of retry loop */ } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index f7f1437..6a5cc0d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -58,7 +58,7 @@ typedef struct f_smgr BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); void (*smgr_pre_ckpt) (void); /* may be NULL */ - void (*smgr_sync) (void); /* may be NULL */ + void (*smgr_sync) (int ckpt_flags, double progress_at_begin); /* may be NULL */ void (*smgr_post_ckpt) (void); /* may be NULL */ } f_smgr; @@ -708,14 +708,18 @@ smgrpreckpt(void) * smgrsync() -- Sync files to disk during checkpoint. */ void -smgrsync(void) +smgrsync(int ckpt_flags, double progress_at_begin) { int i; + /* + * XXX: If we ever have more than one smgr, the remaining progress + * should somehow be divided among all smgrs. + */ for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_sync) - (*(smgrsw[i].smgr_sync)) (); + (*(smgrsw[i].smgr_sync)) (ckpt_flags, progress_at_begin); } } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index ea16c64..a240c43 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] = }, { + {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."), + NULL, + GUC_UNIT_MS + }, + &CheckPointerFsyncDelayThreshold, + -1, -1, 1000000, + NULL, NULL, NULL + }, + + { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), NULL, @@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("checkpointer sleep time during file fsync in checkpoint."), + NULL + }, + &CheckPointerFsyncDelayRatio, + 0.0, 0.0, 2.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 0303ac7..707b433 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -186,6 +186,8 @@ #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables +#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0 +#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable. # - Archiving - diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index 46d3c26..ab266d6 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -23,7 +23,9 @@ extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; +extern int CheckPointerFsyncDelayThreshold; extern double CheckPointCompletionTarget; +extern double CheckPointerFsyncDelayRatio; extern void BackgroundWriterMain(void) __attribute__((noreturn)); extern void CheckpointerMain(void) __attribute__((noreturn)); @@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn)); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern bool ImmediateCheckpointRequested(void); +extern bool IsCheckpointOnSchedule(double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void AbsorbFsyncRequests(void); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 98b6f13..e8efcbe 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); extern void smgrpreckpt(void); -extern void smgrsync(void); +extern void smgrsync(int ckpt_flags, double progress_at_begin); extern void smgrpostckpt(void); extern void AtEOXact_SMgr(void); @@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); extern void mdpreckpt(void); -extern void mdsync(void); +extern void mdsync(int ckpt_flags, double progress_at_begin); extern void mdpostckpt(void); extern void SetForwardFsyncRequests(void); diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 8dcdd4b..efc5ee4 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -63,6 +63,7 @@ enum config_group RESOURCES_KERNEL, RESOURCES_VACUUM_DELAY, RESOURCES_BGWRITER, + RESOURCES_CHECKPOINTER, RESOURCES_ASYNCHRONOUS, WAL, WAL_SETTINGS,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index fdf6625..d09fe4f 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem; */ int CheckPointTimeout = 300; int CheckPointWarning = 30; +int CheckPointerFsyncDelayThreshold = -1; double CheckPointCompletionTarget = 0.5; +double CheckPointerFsyncDelayRatio = 0.0; /* * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t got_SIGHUP = false; -static volatile sig_atomic_t checkpoint_requested = false; -static volatile sig_atomic_t shutdown_requested = false; +extern volatile sig_atomic_t checkpoint_requested = false; +extern volatile sig_atomic_t shutdown_requested = false; /* * Private state @@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time; /* Prototypes for private functions */ static void CheckArchiveTimeout(void); -static bool IsCheckpointOnSchedule(double progress); -static bool ImmediateCheckpointRequested(void); static bool CompactCheckpointerRequestQueue(void); static void UpdateSharedMemoryConfig(void); @@ -643,7 +643,7 @@ CheckArchiveTimeout(void) * this does not check the *current* checkpoint's IMMEDIATE flag, but whether * there is one pending behind it.) */ -static bool +extern bool ImmediateCheckpointRequested(void) { if (checkpoint_requested) @@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress) * checkpoint, and returns true if the progress we've made this far is greater * than the elapsed time/segments. */ -static bool +extern bool IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8079226..9f4177a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -66,6 +66,9 @@ #define DROP_RELS_BSEARCH_THRESHOLD 20 +/* Checkpoint schedule ratio of write phase to fsync phase */ +#define CHECKPOINT_SCHEDULE_RATIO 0.9 + /* GUC variables */ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; @@ -94,7 +97,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence, static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy); static void PinBuffer_Locked(volatile BufferDesc *buf); static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner); -static void BufferSync(int flags); +static void BufferSync(int flags, double ckpt_schedule_ratio); static int SyncOneBuffer(int buf_id, bool skip_recently_used); static void WaitIO(volatile BufferDesc *buf); static bool StartBufferIO(volatile BufferDesc *buf, bool forInput); @@ -1207,7 +1210,7 @@ UnpinBuffer(volatile BufferDesc *buf, bool fixOwner) * remaining flags currently have no effect here. */ static void -BufferSync(int flags) +BufferSync(int flags, double ckpt_schedule_ratio) { int buf_id; int num_to_scan; @@ -1319,7 +1322,7 @@ BufferSync(int flags) /* * Sleep to throttle our I/O rate. */ - CheckpointWriteDelay(flags, (double) num_written / num_to_write); + CheckpointWriteDelay(flags, ckpt_schedule_ratio * (double) num_written / num_to_write); } } @@ -1825,10 +1828,10 @@ CheckPointBuffers(int flags) { TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags); CheckpointStats.ckpt_write_t = GetCurrentTimestamp(); - BufferSync(flags); + BufferSync(flags, CHECKPOINT_SCHEDULE_RATIO); CheckpointStats.ckpt_sync_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); - smgrsync(); + smgrsync(flags, CHECKPOINT_SCHEDULE_RATIO); CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index e629181..9809fb1 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -21,6 +21,7 @@ */ #include "postgres.h" +#include <signal.h> #include <unistd.h> #include <fcntl.h> #include <sys/file.h> @@ -44,6 +45,9 @@ #define FSYNCS_PER_ABSORB 10 #define UNLINKS_PER_ABSORB 10 +/* Protect too long sleep in each file fsync. */ +#define MAX_FSYNC_SLEEP 10000 + /* * Special values for the segno arg to RememberFsyncRequest. * @@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL; static CycleCtr mdsync_cycle_ctr = 0; static CycleCtr mdckpt_cycle_ctr = 0; +extern volatile sig_atomic_t checkpoint_requested; +extern volatile sig_atomic_t shutdown_requested; typedef enum /* behavior for mdopen & _mdfd_getseg */ { @@ -235,7 +241,7 @@ SetForwardFsyncRequests(void) /* Perform any pending fsyncs we may have queued up, then drop table */ if (pendingOpsTable) { - mdsync(); + mdsync(CHECKPOINT_IMMEDIATE, 0.0); hash_destroy(pendingOpsTable); } pendingOpsTable = NULL; @@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) * mdsync() -- Sync previous writes to stable storage. */ void -mdsync(void) +mdsync(int ckpt_flags, double ckpt_schedule_ratio) { static bool mdsync_in_progress = false; @@ -984,6 +990,7 @@ mdsync(void) /* Statistics on sync times */ int processed = 0; + int num_to_process; instr_time sync_start, sync_end, sync_diff; @@ -1052,6 +1059,7 @@ mdsync(void) /* Now scan the hashtable for fsync requests to process */ absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); + num_to_process = hash_get_num_entries(pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { ForkNumber forknum; @@ -1171,6 +1179,29 @@ mdsync(void) FilePathName(seg->mdfd_vfd), (double) elapsed / 1000); + /* + * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' + * for giving priority to executing transaction. + */ + if(CheckPointerFsyncDelayThreshold >= 0 && + !shutdown_requested && + !ImmediateCheckpointRequested() && + !(ckpt_flags & CHECKPOINT_IMMEDIATE) && + !(ckpt_flags & CHECKPOINT_FORCE) && + !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) && + (elapsed / 1000 > CheckPointerFsyncDelayThreshold) && + IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process)) + { + double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio; + + /* Too long sleep is not good for checkpoint scheduler */ + if(fsync_sleep > MAX_FSYNC_SLEEP) + fsync_sleep = MAX_FSYNC_SLEEP; + pg_usleep(fsync_sleep * 1000L); + if(log_checkpoints) + elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec", + fsync_sleep); + } break; /* out of retry loop */ } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index f7f1437..e704b52 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -58,7 +58,7 @@ typedef struct f_smgr BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); void (*smgr_pre_ckpt) (void); /* may be NULL */ - void (*smgr_sync) (void); /* may be NULL */ + void (*smgr_sync) (int ckpt_flags, double progress_at_begin); /* may be NULL */ void (*smgr_post_ckpt) (void); /* may be NULL */ } f_smgr; @@ -708,14 +708,18 @@ smgrpreckpt(void) * smgrsync() -- Sync files to disk during checkpoint. */ void -smgrsync(void) +smgrsync(int ckpt_flags, double ckpt_schedule_ratio) { int i; + /* + * XXX: If we ever have more than one smgr, the remaining progress + * should somehow be divided among all smgrs. + */ for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_sync) - (*(smgrsw[i].smgr_sync)) (); + (*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio); } } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index ea16c64..a240c43 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] = }, { + {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."), + NULL, + GUC_UNIT_MS + }, + &CheckPointerFsyncDelayThreshold, + -1, -1, 1000000, + NULL, NULL, NULL + }, + + { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), NULL, @@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("checkpointer sleep time during file fsync in checkpoint."), + NULL + }, + &CheckPointerFsyncDelayRatio, + 0.0, 0.0, 2.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 0303ac7..707b433 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -186,6 +186,8 @@ #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables +#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0 +#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable. # - Archiving - diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index 46d3c26..ab266d6 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -23,7 +23,9 @@ extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; +extern int CheckPointerFsyncDelayThreshold; extern double CheckPointCompletionTarget; +extern double CheckPointerFsyncDelayRatio; extern void BackgroundWriterMain(void) __attribute__((noreturn)); extern void CheckpointerMain(void) __attribute__((noreturn)); @@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn)); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern bool ImmediateCheckpointRequested(void); +extern bool IsCheckpointOnSchedule(double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void AbsorbFsyncRequests(void); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 98b6f13..d68b950 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); extern void smgrpreckpt(void); -extern void smgrsync(void); +extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio); extern void smgrpostckpt(void); extern void AtEOXact_SMgr(void); @@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); extern void mdpreckpt(void); -extern void mdsync(void); +extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio); extern void mdpostckpt(void); extern void SetForwardFsyncRequests(void); diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 8dcdd4b..efc5ee4 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -63,6 +63,7 @@ enum config_group RESOURCES_KERNEL, RESOURCES_VACUUM_DELAY, RESOURCES_BGWRITER, + RESOURCES_CHECKPOINTER, RESOURCES_ASYNCHRONOUS, WAL, WAL_SETTINGS,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index fdf6625..d09fe4f 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -143,14 +143,16 @@ static CheckpointerShmemStruct *CheckpointerShmem; */ int CheckPointTimeout = 300; int CheckPointWarning = 30; +int CheckPointerFsyncDelayThreshold = -1; double CheckPointCompletionTarget = 0.5; +double CheckPointerFsyncDelayRatio = 0.0; /* * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t got_SIGHUP = false; -static volatile sig_atomic_t checkpoint_requested = false; -static volatile sig_atomic_t shutdown_requested = false; +extern volatile sig_atomic_t checkpoint_requested = false; +extern volatile sig_atomic_t shutdown_requested = false; /* * Private state @@ -168,8 +170,6 @@ static pg_time_t last_xlog_switch_time; /* Prototypes for private functions */ static void CheckArchiveTimeout(void); -static bool IsCheckpointOnSchedule(double progress); -static bool ImmediateCheckpointRequested(void); static bool CompactCheckpointerRequestQueue(void); static void UpdateSharedMemoryConfig(void); @@ -643,7 +643,7 @@ CheckArchiveTimeout(void) * this does not check the *current* checkpoint's IMMEDIATE flag, but whether * there is one pending behind it.) */ -static bool +extern bool ImmediateCheckpointRequested(void) { if (checkpoint_requested) @@ -737,7 +737,7 @@ CheckpointWriteDelay(int flags, double progress) * checkpoint, and returns true if the progress we've made this far is greater * than the elapsed time/segments. */ -static bool +extern bool IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8079226..93a879a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -66,6 +66,9 @@ #define DROP_RELS_BSEARCH_THRESHOLD 20 +/* Checkpoint schedule ratio of write phase to fsync phase */ +#define CKPT_SCHEDULE_RATIO 0.9 + /* GUC variables */ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; @@ -1828,7 +1831,7 @@ CheckPointBuffers(int flags) BufferSync(flags); CheckpointStats.ckpt_sync_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); - smgrsync(); + smgrsync(flags, CKPT_SCHEDULE_RATIO); CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp(); TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index e629181..9809fb1 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -21,6 +21,7 @@ */ #include "postgres.h" +#include <signal.h> #include <unistd.h> #include <fcntl.h> #include <sys/file.h> @@ -44,6 +45,9 @@ #define FSYNCS_PER_ABSORB 10 #define UNLINKS_PER_ABSORB 10 +/* Protect too long sleep in each file fsync. */ +#define MAX_FSYNC_SLEEP 10000 + /* * Special values for the segno arg to RememberFsyncRequest. * @@ -162,6 +166,8 @@ static List *pendingUnlinks = NIL; static CycleCtr mdsync_cycle_ctr = 0; static CycleCtr mdckpt_cycle_ctr = 0; +extern volatile sig_atomic_t checkpoint_requested; +extern volatile sig_atomic_t shutdown_requested; typedef enum /* behavior for mdopen & _mdfd_getseg */ { @@ -235,7 +241,7 @@ SetForwardFsyncRequests(void) /* Perform any pending fsyncs we may have queued up, then drop table */ if (pendingOpsTable) { - mdsync(); + mdsync(CHECKPOINT_IMMEDIATE, 0.0); hash_destroy(pendingOpsTable); } pendingOpsTable = NULL; @@ -974,7 +980,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) * mdsync() -- Sync previous writes to stable storage. */ void -mdsync(void) +mdsync(int ckpt_flags, double ckpt_schedule_ratio) { static bool mdsync_in_progress = false; @@ -984,6 +990,7 @@ mdsync(void) /* Statistics on sync times */ int processed = 0; + int num_to_process; instr_time sync_start, sync_end, sync_diff; @@ -1052,6 +1059,7 @@ mdsync(void) /* Now scan the hashtable for fsync requests to process */ absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); + num_to_process = hash_get_num_entries(pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { ForkNumber forknum; @@ -1171,6 +1179,29 @@ mdsync(void) FilePathName(seg->mdfd_vfd), (double) elapsed / 1000); + /* + * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' + * for giving priority to executing transaction. + */ + if(CheckPointerFsyncDelayThreshold >= 0 && + !shutdown_requested && + !ImmediateCheckpointRequested() && + !(ckpt_flags & CHECKPOINT_IMMEDIATE) && + !(ckpt_flags & CHECKPOINT_FORCE) && + !(ckpt_flags & CHECKPOINT_END_OF_RECOVERY) && + (elapsed / 1000 > CheckPointerFsyncDelayThreshold) && + IsCheckpointOnSchedule(ckpt_schedule_ratio + (1.0 - ckpt_schedule_ratio) * (double) processed / num_to_process)) + { + double fsync_sleep = (elapsed / 1000) * CheckPointerFsyncDelayRatio; + + /* Too long sleep is not good for checkpoint scheduler */ + if(fsync_sleep > MAX_FSYNC_SLEEP) + fsync_sleep = MAX_FSYNC_SLEEP; + pg_usleep(fsync_sleep * 1000L); + if(log_checkpoints) + elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec", + fsync_sleep); + } break; /* out of retry loop */ } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index f7f1437..da68900 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -58,7 +58,7 @@ typedef struct f_smgr BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); void (*smgr_pre_ckpt) (void); /* may be NULL */ - void (*smgr_sync) (void); /* may be NULL */ + void (*smgr_sync) (int ckpt_flags, double ckpt_schedule_ratio); /* may be NULL */ void (*smgr_post_ckpt) (void); /* may be NULL */ } f_smgr; @@ -708,14 +708,18 @@ smgrpreckpt(void) * smgrsync() -- Sync files to disk during checkpoint. */ void -smgrsync(void) +smgrsync(int ckpt_flags, double ckpt_schedule_ratio) { int i; + /* + * XXX: If we ever have more than one smgr, the remaining progress + * should somehow be divided among all smgrs. + */ for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_sync) - (*(smgrsw[i].smgr_sync)) (); + (*(smgrsw[i].smgr_sync)) (ckpt_flags, ckpt_schedule_ratio); } } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index ea16c64..a240c43 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2014,6 +2014,17 @@ static struct config_int ConfigureNamesInt[] = }, { + {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."), + NULL, + GUC_UNIT_MS + }, + &CheckPointerFsyncDelayThreshold, + -1, -1, 1000000, + NULL, NULL, NULL + }, + + { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), NULL, @@ -2551,6 +2562,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER, + gettext_noop("checkpointer sleep time during file fsync in checkpoint."), + NULL + }, + &CheckPointerFsyncDelayRatio, + 0.0, 0.0, 2.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 0303ac7..b4b3a9d 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -186,6 +186,8 @@ #checkpoint_timeout = 5min # range 30s-1h #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0 #checkpoint_warning = 30s # 0 disables +#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 2.0 +#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable. # - Archiving - diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index 46d3c26..ab266d6 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -23,7 +23,9 @@ extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; +extern int CheckPointerFsyncDelayThreshold; extern double CheckPointCompletionTarget; +extern double CheckPointerFsyncDelayRatio; extern void BackgroundWriterMain(void) __attribute__((noreturn)); extern void CheckpointerMain(void) __attribute__((noreturn)); @@ -31,6 +33,8 @@ extern void CheckpointerMain(void) __attribute__((noreturn)); extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); +extern bool ImmediateCheckpointRequested(void); +extern bool IsCheckpointOnSchedule(double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno); extern void AbsorbFsyncRequests(void); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 98b6f13..d68b950 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -100,7 +100,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); extern void smgrpreckpt(void); -extern void smgrsync(void); +extern void smgrsync(int ckpt_flags, double ckpt_schedule_ratio); extern void smgrpostckpt(void); extern void AtEOXact_SMgr(void); @@ -126,7 +126,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); extern void mdpreckpt(void); -extern void mdsync(void); +extern void mdsync(int ckpt_flags, double ckpt_schedule_ratio); extern void mdpostckpt(void); extern void SetForwardFsyncRequests(void); diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 8dcdd4b..efc5ee4 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -63,6 +63,7 @@ enum config_group RESOURCES_KERNEL, RESOURCES_VACUUM_DELAY, RESOURCES_BGWRITER, + RESOURCES_CHECKPOINTER, RESOURCES_ASYNCHRONOUS, WAL, WAL_SETTINGS,
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers