Re: [HACKERS] Redesigning checkpoint_segments

Heikki Linnakangas Fri, 26 Jun 2015 04:08:58 -0700

On 05/27/2015 12:26 AM, Jeff Janes wrote:

On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <[email protected]> wrote:

On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <[email protected]> wrote:

One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow.  But running out of disk space on pg_xlog is no more

fun

during archive recovery than it is during normal operations.  So why
shouldn't max_wal_size be active during recovery?


The following message of commit 7181530 explains why.

     In standby mode, respect checkpoint_segments in addition to
     checkpoint_timeout to trigger restartpoints. We used to deliberately
only
     do time-based restartpoints, because if checkpoint_segments is small we
     would spend time doing restartpoints more often than really necessary.
     But now that restartpoints are done in bgwriter, they're not as
     disruptive as they used to be. Secondly, because streaming replication
     stores the streamed WAL files in pg_xlog, we want to clean it up more
     often to avoid running out of disk space when checkpoint_timeout is
large
     and checkpoint_segments small.

Previously users were more likely to fall into this trouble (i.e., too
frequent
occurrence of restartpoints) because the default value of
checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so
now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.


I see.  The old behavior was present for the same reason we decided to split
checkpoint_segments into max_wal_size and min_wal_size.

That is, the default checkpoint_segments was small, and it had to be small
because increasing it would cause more space to be used even when that
extra space was not helpful.

So perhaps we can consider this change a completion of the max_wal_size
work, rather than a new feature?

Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segmentsmade sense when we initially did that, but it has gradually become lessand less sensible after that, as we got streaming replication, and as westarted to keep all restored segments in pg_xlog even in archive recovery.

It seems to be a trivial change to implement that, although I might be
overlooking something subtle (pasted below, also attached)

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr, int reqLen,
                 * Request a restartpoint if we've replayed too much xlog
since the
                 * last one.
                 */
-               if (StandbyModeRequested && bgwriterLaunched)
+               if (bgwriterLaunched)
                {
                        if (XLogCheckpointNeeded(readSegNo))
                        {

This keeps pg_xlog at about 67% of max_wal_size during archive recovery
(because checkpoint_completion_target is accounted for but goes unused)

Hmm. checkpoint_completion_target is used when determining progressagainst checkpoint_timeout just fine, but the problem is that if you dojust the above, IsCheckpointOnSchedule() still won't consider consumedWAL when it determines whether the restartpoint is "on time". So theerror is in the other direction: if you set max_wal_size to a smallvalue, and checkpoint_timeout to a large value, the restartpoint wouldthink that it has plenty of time to complete, and exceed max_wal_size.We need to fix IsCheckpointOnSchedule() to also track progress againstmax_wal_size during recovery.

I came up with the attached patch as a first attempt. It enables thesame logic to calculate if the checkpoint is on schedule to be used inrecovery. But there's a little problem (also explained in a comment inthe patch):

There is a large gap between a checkpoint's redo-pointer, and thecheckpoint record itself (determined by checkpoint_completion_target).When we're not in recovery, we set the redo-pointer for the currentcheckpoint first, then start flushing data, and finally write thecheckpoint record. The logic in IsCheckpointOnSchedule() calculates a)how much WAL has been generated since the beginning of the checkpoint,i.e its redo-pointer, and b) what fraction of shared_buffers has beenflushed to disk. But in recovery, we only start the restartpoint afterreplaying the checkpoint record, so at the beginning of a restartpoint,we're actually already behind schedule by the amount of WAL between theredo-pointer and the record itself.

I'm not sure what to do about this. With the attached patch, you get thesame leisurely pacing with restartpoints as you get with checkpoints,but you exceed max_wal_size during recovery, by the amount determined bycheckpoint_completion_target. Alternatively, we could try to performrestartpoints faster then checkpoints, but then you'll get nastycheckpoint I/O storms in recovery.

A bigger change would be to write a WAL record at the beginning of acheckpoint. It wouldn't do anything else, but it would be a hint torecovery that there's going to be a checkpoint record later whoseredo-pointer will point to that record. We could then start therestartpoint at that record already, before seeing the checkpoint recorditself.

I think the attached is better than nothing, but I'll take a look atthat beginning-of-checkpoint idea. It might be too big a change to do atthis point, but I'd really like to fix this properly for 9.5, sincewe've changed with the way checkpoints are scheduled anyway.


- Heikki

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..cc90ae6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10961,7 +10961,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (StandbyModeRequested && bgwriterLaunched)
+		if (bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..bc49ee4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -475,10 +475,12 @@ CheckpointerMain(void)
 
 			/*
 			 * Initialize checkpointer-private variables used during
-			 * checkpoint
+			 * checkpoint.
 			 */
 			ckpt_active = true;
-			if (!do_restartpoint)
+			if (do_restartpoint)
+				ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+			else
 				ckpt_start_recptr = GetInsertRecPtr();
 			ckpt_start_time = now;
 			ckpt_cached_elapsed = 0;
@@ -720,7 +722,7 @@ CheckpointWriteDelay(int flags, double progress)
 
 /*
  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
+ *		 (or restartpoint) in time?
  *
  * Compares the current progress against the time/segments elapsed since last
  * checkpoint, and returns true if the progress we've made this far is greater
@@ -757,17 +759,28 @@ IsCheckpointOnSchedule(double progress)
 	 * compares against RedoRecptr, so this is not completely accurate.
 	 * However, it's good enough for our purposes, we're only calculating an
 	 * estimate anyway.
+	 *
+	 * During recovery, we compare last replayed WAL record's location with
+	 * the location computed before calling CreateRestartPoint. That maintains
+	 * the same pacing as we have during checkpoints in normal operation, but
+	 * we might exceed max_wal_size by a fair amount. That's because there can
+	 * be a large gap between a checkpoint's redo-pointer and the checkpoint
+	 * record itself, and we only start the restartpoint after we've seen the
+	 * checkpoint record. (The gap is typically up to CheckPointSegments *
+	 * checkpoint_completion_target where checkpoint_completion_target is the
+	 * value that was in effect when the WAL was generated).
 	 */
-	if (!RecoveryInProgress())
-	{
+	if (RecoveryInProgress())
+		recptr = GetXLogReplayRecPtr(NULL);
+	else
 		recptr = GetInsertRecPtr();
-		elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
+	elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
 
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
+	if (progress < elapsed_xlogs)
+	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f", progress, elapsed_xlogs);
+		ckpt_cached_elapsed = elapsed_xlogs;
+		return false;
 	}
 
 	/*
@@ -779,11 +792,13 @@ IsCheckpointOnSchedule(double progress)
 
 	if (progress < elapsed_time)
 	{
+		elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
 
 	/* It looks like we're on schedule. */
+	 elog(LOG, "on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
 	return true;
 }

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

Reply via email to