On 05/27/2015 12:26 AM, Jeff Janes wrote:
On Thu, May 21, 2015 at 8:40 AM, Fujii Masao <masao.fu...@gmail.com> wrote:
On Thu, May 21, 2015 at 3:53 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
One of the points of max_wal_size and its predecessor is to limit how big
pg_xlog can grow. But running out of disk space on pg_xlog is no more
fun
during archive recovery than it is during normal operations. So why
shouldn't max_wal_size be active during recovery?
The following message of commit 7181530 explains why.
In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately
only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is
large
and checkpoint_segments small.
Previously users were more likely to fall into this trouble (i.e., too
frequent
occurrence of restartpoints) because the default value of
checkpoint_segments
was very small, I guess. But we increased the default of max_wal_size, so
now
the risk of that trouble seems to be smaller than before, and maybe we can
allow max_wal_size to trigger restartpoints.
I see. The old behavior was present for the same reason we decided to split
checkpoint_segments into max_wal_size and min_wal_size.
That is, the default checkpoint_segments was small, and it had to be small
because increasing it would cause more space to be used even when that
extra space was not helpful.
So perhaps we can consider this change a completion of the max_wal_size
work, rather than a new feature?
Yeah, I'm inclined to change the behaviour. Ignoring checkpoint_segments
made sense when we initially did that, but it has gradually become less
and less sensible after that, as we got streaming replication, and as we
started to keep all restored segments in pg_xlog even in archive recovery.
It seems to be a trivial change to implement that, although I might be
overlooking something subtle (pasted below, also attached)
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10946,7 +10946,7 @@ XLogPageRead(XLogReaderState *xlogreader,
XLogRecPtr targetPagePtr, int reqLen,
* Request a restartpoint if we've replayed too much xlog
since the
* last one.
*/
- if (StandbyModeRequested && bgwriterLaunched)
+ if (bgwriterLaunched)
{
if (XLogCheckpointNeeded(readSegNo))
{
This keeps pg_xlog at about 67% of max_wal_size during archive recovery
(because checkpoint_completion_target is accounted for but goes unused)
Hmm. checkpoint_completion_target is used when determining progress
against checkpoint_timeout just fine, but the problem is that if you do
just the above, IsCheckpointOnSchedule() still won't consider consumed
WAL when it determines whether the restartpoint is "on time". So the
error is in the other direction: if you set max_wal_size to a small
value, and checkpoint_timeout to a large value, the restartpoint would
think that it has plenty of time to complete, and exceed max_wal_size.
We need to fix IsCheckpointOnSchedule() to also track progress against
max_wal_size during recovery.
I came up with the attached patch as a first attempt. It enables the
same logic to calculate if the checkpoint is on schedule to be used in
recovery. But there's a little problem (also explained in a comment in
the patch):
There is a large gap between a checkpoint's redo-pointer, and the
checkpoint record itself (determined by checkpoint_completion_target).
When we're not in recovery, we set the redo-pointer for the current
checkpoint first, then start flushing data, and finally write the
checkpoint record. The logic in IsCheckpointOnSchedule() calculates a)
how much WAL has been generated since the beginning of the checkpoint,
i.e its redo-pointer, and b) what fraction of shared_buffers has been
flushed to disk. But in recovery, we only start the restartpoint after
replaying the checkpoint record, so at the beginning of a restartpoint,
we're actually already behind schedule by the amount of WAL between the
redo-pointer and the record itself.
I'm not sure what to do about this. With the attached patch, you get the
same leisurely pacing with restartpoints as you get with checkpoints,
but you exceed max_wal_size during recovery, by the amount determined by
checkpoint_completion_target. Alternatively, we could try to perform
restartpoints faster then checkpoints, but then you'll get nasty
checkpoint I/O storms in recovery.
A bigger change would be to write a WAL record at the beginning of a
checkpoint. It wouldn't do anything else, but it would be a hint to
recovery that there's going to be a checkpoint record later whose
redo-pointer will point to that record. We could then start the
restartpoint at that record already, before seeing the checkpoint record
itself.
I think the attached is better than nothing, but I'll take a look at
that beginning-of-checkpoint idea. It might be too big a change to do at
this point, but I'd really like to fix this properly for 9.5, since
we've changed with the way checkpoints are scheduled anyway.
- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..cc90ae6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10961,7 +10961,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
* Request a restartpoint if we've replayed too much xlog since the
* last one.
*/
- if (StandbyModeRequested && bgwriterLaunched)
+ if (bgwriterLaunched)
{
if (XLogCheckpointNeeded(readSegNo))
{
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0dce6a8..bc49ee4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -475,10 +475,12 @@ CheckpointerMain(void)
/*
* Initialize checkpointer-private variables used during
- * checkpoint
+ * checkpoint.
*/
ckpt_active = true;
- if (!do_restartpoint)
+ if (do_restartpoint)
+ ckpt_start_recptr = GetXLogReplayRecPtr(NULL);
+ else
ckpt_start_recptr = GetInsertRecPtr();
ckpt_start_time = now;
ckpt_cached_elapsed = 0;
@@ -720,7 +722,7 @@ CheckpointWriteDelay(int flags, double progress)
/*
* IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- * in time?
+ * (or restartpoint) in time?
*
* Compares the current progress against the time/segments elapsed since last
* checkpoint, and returns true if the progress we've made this far is greater
@@ -757,17 +759,28 @@ IsCheckpointOnSchedule(double progress)
* compares against RedoRecptr, so this is not completely accurate.
* However, it's good enough for our purposes, we're only calculating an
* estimate anyway.
+ *
+ * During recovery, we compare last replayed WAL record's location with
+ * the location computed before calling CreateRestartPoint. That maintains
+ * the same pacing as we have during checkpoints in normal operation, but
+ * we might exceed max_wal_size by a fair amount. That's because there can
+ * be a large gap between a checkpoint's redo-pointer and the checkpoint
+ * record itself, and we only start the restartpoint after we've seen the
+ * checkpoint record. (The gap is typically up to CheckPointSegments *
+ * checkpoint_completion_target where checkpoint_completion_target is the
+ * value that was in effect when the WAL was generated).
*/
- if (!RecoveryInProgress())
- {
+ if (RecoveryInProgress())
+ recptr = GetXLogReplayRecPtr(NULL);
+ else
recptr = GetInsertRecPtr();
- elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
+ elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
- if (progress < elapsed_xlogs)
- {
- ckpt_cached_elapsed = elapsed_xlogs;
- return false;
- }
+ if (progress < elapsed_xlogs)
+ {
+ elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f", progress, elapsed_xlogs);
+ ckpt_cached_elapsed = elapsed_xlogs;
+ return false;
}
/*
@@ -779,11 +792,13 @@ IsCheckpointOnSchedule(double progress)
if (progress < elapsed_time)
{
+ elog(LOG, "not on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
ckpt_cached_elapsed = elapsed_time;
return false;
}
/* It looks like we're on schedule. */
+ elog(LOG, "on schedule, progress: %f elapsed_xlogs: %f elapsed_time %f", progress, elapsed_xlogs, elapsed_time);
return true;
}
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers