Hi,

I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(

The attached patches are rebased to the master and additional one
mentioned below.

At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paqu...@gmail.com> 
wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkyv_45d...@mail.gmail.com>

> A couple of months back is has been reported to pgsql-bugs that WAL
> segments were always switched with a low value of archive_timeout even
> if a system is completely idle:
> http://www.postgresql.org/message-id/20151016203031.3019.72...@wrigleys.postgresql.org
> In short, a closer look at the problem has showed up that the logic in
> charge of checking if a checkpoint should be skipped or not is
> currently broken, because it completely ignores standby snapshots in
> its calculation of the WAL activity. So a checkpoint always occurs
> after checkpoint_timeout on an idle system since hot_standby has been
> introduced as wal_level. This did not get better from 9.4, since
> standby snapshots are logged every 15s by the background writer
> process. In 9.6, since wal_level = 'archive' and 'hot_standby'
> actually has the same meaning, the skip logic that worked with
> wal_level = 'archive' does not do its job anymore.
> 
> One solution that has been discussed is to track the progress of WAL
> activity when doing record insertion by being able to mark some
> records as not updating the progress of WAL. Standby snapshot records
> enter in this category, making the checkpoint skip logic more robust.
> Attached is a patch implementing a solution for it, by adding in

If I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.

> WALInsertLock a new field that gets updated for each record to track
> the LSN progress. This allows to reliably skip the generation of
> standby snapshots in the bgwriter or checkpoints on an idle system.

WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.

https://www.postgresql.org/message-id/caa4ek1lb9hdq+f8lw8bgrqx6sw42xaikx1uq2dzk+auegbf...@mail.gmail.com
Amit> Taking and releasing locks again and again (which is done in patch)
Amit> matters much more than adding few instructions under it and I think
Amit> if you would have written the code in-a-way as in patch in some of
Amit> the critical path, it would have been regressed badly, but because
Amit> checkpoint doesn't happen often, reproducing regression is difficult.


https://www.postgresql.org/message-id/cab7npqso6hvj0t6lut84pcy+_ihitdt64ze2d+sjrhzy72y...@mail.gmail.com
> > Also, I think it is worth to once take the performance data for
> > write tests (using pgbench 30 minute run or some other way) with
> > minimum checkpoint_timeout (i.e 30s) to see if the additional locking
> > has any impact on performance.  I think taking locks at intervals
> > of 15s or 30s should not matter much, but it is better to be safe.
> 
> I don't think performance will be impacted, but there are no reasons
> to not do any measurements either. I'll try to get some graphs
> tomorrow with runs on my laptop, mainly looking for any effects of
> this patch on TPS when checkpoints show up.

I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.

> Per discussion with Andres at PGcon, we decided that this is an
> optimization, only for 9.7~ because this has been broken for a long
> time. I have also changed XLogIncludeOrigin() to use a more generic
> routine to set of status flags for a record being inserted:
> XLogSetFlags(). This routine can use two flags:
> - INCLUDE_ORIGIN to decide if the origin should be logged or not
> - NO_PROGRESS to decide at insertion if a record should update the LSN
> progress or not.
> Andres mentioned me that we'd want to have something similar to
> XLogIncludeOrigin, but while hacking I noticed that grouping both
> things under the same umbrella made more sense.

This looks reasonable.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 686e4981c0d7ab3dd9e919f8b203aeb475f89a3b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 10:19:55 +0900
Subject: [PATCH 1/3] hs-checkpoints-v12-1

Rebased version of v11-1
---
 src/backend/access/heap/heapam.c          |  10 +--
 src/backend/access/transam/xact.c         |   2 +-
 src/backend/access/transam/xlog.c         | 127 +++++++++++++++++++++++++-----
 src/backend/access/transam/xloginsert.c   |  15 ++--
 src/backend/postmaster/bgwriter.c         |  22 +++---
 src/backend/postmaster/checkpointer.c     |   1 +
 src/backend/replication/logical/message.c |   2 +-
 src/backend/storage/ipc/standby.c         |   6 +-
 src/include/access/xlog.h                 |  12 ++-
 src/include/access/xloginsert.h           |   2 +-
 10 files changed, 154 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1b9a97..289d240 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,30 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup. This takes also
+ * advantage to avoid 8-byte torn reads on some platforms by using the
+ * fact that each insert lock is located on the same cache line.
+ * XXX: There is still room for more improvements here, particularly
+ * WAL operations related to unlogged relations (INIT_FORKNUM) should not
+ * update the progress LSN as those relations are reset during crash
+ * recovery so enforcing buffers of such relations to be flushed for
+ * example in the case of a load only on unlogged relations is a waste
+ * of disk write.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -882,6 +901,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. As of
+ * now, this controls if the progress LSN positions are updated.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -894,7 +916,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -993,6 +1017,25 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the progress LSN positions. At least one WAL insertion lock
+	 * is already taken appropriately before doing that, and it is just more
+	 * simple to do that here where WAL record data and type is at hand.
+	 * The progress is set at the start position of the record tracked that
+	 * is being added, making easier checkpoint progress tracking as the
+	 * control file already saves the start LSN position of the last
+	 * checkpoint run. If an exclusive lock is taken for WAL insertion,
+	 * there is actually no need to update all the progression fields, so
+	 * just do it on the first one.
+	 */
+	if ((flags & XLOG_NO_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -4716,6 +4759,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7992,6 +8036,55 @@ GetFlushRecPtr(void)
 }
 
 /*
+ * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
+ * at the last significant WAL activity, or in other words any activity
+ * not referring to standby logging as of now. Finding the last activity
+ * position is done by scanning each WAL insertion lock by taking directly
+ * the light-weight lock associated to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
  * Get the time of the last xlog segment switch
  */
 pg_time_t
@@ -8251,7 +8344,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8332,34 +8425,30 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been no
+	 * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
+	 * duplicate checkpoints when the system is idle. That wastes log space,
+	 * and more importantly it exposes us to possible loss of both current and
+	 * previous checkpoint records if the machine crashes just as we're writing
+	 * the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..23f1e67 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 status_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	status_flags = 0;
 	begininsert_called = false;
 }
 
@@ -387,10 +387,10 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
  * Should this record include the replication origin if one is set up?
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+	status_flags = flags;
 }
 
 /*
@@ -450,7 +450,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, status_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +701,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((status_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1002034..3a791eb 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,23 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * only log if enough time has passed, that some WAL activity
+			 * has happened since last checkpoint, and that some xlog record
+			 * has been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d702a48..a729a3d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -603,6 +603,7 @@ CheckArchiveTimeout(void)
 		XLogRecPtr	switchpoint;
 
 		/* OK, it's time to switch */
+		elog(LOG, "Request XLog Switch");
 		switchpoint = RequestXLogSwitch();
 
 		/*
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 547f1a8..9774155 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -963,7 +963,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -987,6 +988,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_NO_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1034,6 +1037,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_NO_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..dbd4cff 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the origin */
+#define XLOG_NO_PROGRESS	0x02	/* do not update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);
-- 
2.9.2

>From 676ab7c15ccb99e4e18cb1aceef60795223ab569 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 16:21:54 +0900
Subject: [PATCH 2/3] hs-checkpoints-v12-2

Make XLOG_SWITCH NO_PROGRESS and manage log switching LSN to avoid
excessive log switching and checkpoints.
---
 src/backend/access/transam/xlog.c     |  2 ++
 src/backend/postmaster/checkpointer.c | 40 +++++++++++++++++++++++++----------
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 289d240..a582759 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9191,6 +9191,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_NO_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a729a3d..4b7ff4b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -600,20 +600,38 @@ CheckArchiveTimeout(void)
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		elog(LOG, "Request XLog Switch");
-		switchpoint = RequestXLogSwitch();
+		static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
 
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * switch segment only when any substantial progress have made from
+		 * the last segment switching by timeout. Segment switching by other
+		 * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
+		 * matter since it just causes possible one excessive segment switch.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_xlog_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			/* OK, it's time to switch */
+			elog(LOG, "Request XLog Switch");
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+
+			/*
+			 * This switchpoint is not the LSN for the next XLOG record but
+			 * just after this log switch record. But either will do for
+			 * comparing with GetProgressRecPtr().
+			 */
+			last_xlog_switch_lsn = switchpoint;
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
-- 
2.9.2

>From 6cf5801882e1206c1da11b4627175f64b5a4ca97 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Tue, 27 Sep 2016 10:20:12 +0900
Subject: [PATCH 3/3] hs-checkpoints-v12-3

Rebased version of v11-2. Several debugging logs.
---
 src/backend/access/transam/xlog.c | 10 +++++++++-
 src/backend/postmaster/bgwriter.c |  4 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a582759..3795037 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8448,8 +8448,12 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
+		elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X, ckpt %X/%X",
+			 (uint32) (progress_lsn >> 32), (uint32) progress_lsn,
+			 (uint32) (ControlFile->checkPoint >> 32), (uint32) ControlFile->checkPoint);
 		if (progress_lsn == ControlFile->checkPoint)
 		{
+			elog(LOG, "Checkpoint is skipped");
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -8616,7 +8620,11 @@ CreateCheckPoint(int flags)
 	 * recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
-		LogStandbySnapshot();
+	{
+		XLogRecPtr lsn = LogStandbySnapshot();
+		elog(LOG, "snapshot taken by checkpoint %X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
+	}
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3a791eb..7637a1d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -331,7 +331,9 @@ BackgroundWriterMain(void)
 				GetLastCheckpointRecPtr() < current_progress_lsn &&
 				last_progress_lsn < current_progress_lsn)
 			{
-				(void) LogStandbySnapshot();
+				XLogRecPtr lsn = LogStandbySnapshot();
+				elog(LOG, "snapshot taken by bgwriter %X/%X",
+					 (uint32) (lsn >> 32), (uint32) lsn);
 				last_snapshot_ts = now;
 				last_progress_lsn = current_progress_lsn;
 			}
-- 
2.9.2

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to