Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

Michael Paquier Sun, 13 Nov 2016 22:10:31 -0800

On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund <[email protected]> wrote:
> On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
>
>> + * This takes also
>> + * advantage to avoid 8-byte torn reads on some platforms by using the
>> + * fact that each insert lock is located on the same cache line.
>
> Something residing on the same cache line doens't provide that guarantee
> on all platforms.


OK. Let's remove it then.

>> + * XXX: There is still room for more improvements here, particularly
>> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not
>> + * update the progress LSN as those relations are reset during crash
>> + * recovery so enforcing buffers of such relations to be flushed for
>> + * example in the case of a load only on unlogged relations is a waste
>> + * of disk write.
>
> Commit records still have to be written, everything else doesn't write
> WAL. So I'm doubtful this matters much?

Hm, okay. In most cases this may not matter... Let's rip that off.

>> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr 
>> fpw_lsn)
>>               inserted = true;
>>       }
>>
>> +     /*
>> +      * Update the LSN progress positions. At least one WAL insertion lock
>> +      * is already taken appropriately before doing that, and it is simpler
>> +      * to do that here when the WAL record data and type are at hand.
>
> But we don't use the "WAL record data and type"?

Yes, at some point this patch did so...

>> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in
>> + * other words any activity not referring to standby logging or segment
>> + * switches. Finding the last activity position is done by scanning each
>> + * WAL insertion lock by taking directly the light-weight lock associated
>> + * to it.
>> + */
>
> I'd prefer not to list the specific records here - that's just
> guaranteed to get out of date. Why not say something "any activity not
> requiring a checkpoint to be triggered" or such?

OK. Makes sense to minimize maintenance.

>> +      * If this isn't a shutdown or forced checkpoint, and if there has 
>> been no
>> +      * WAL activity, skip the checkpoint.  The idea here is to avoid 
>> inserting
>> +      * duplicate checkpoints when the system is idle. That wastes log 
>> space,
>> +      * and more importantly it exposes us to possible loss of both current 
>> and
>> +      * previous checkpoint records if the machine crashes just as we're 
>> writing
>> +      * the update.
>
> Shouldn't this mention archiving and also that we also ignore some forms
> of WAL activity?

I have reworded that as:
"If this isn't a shutdown or forced checkpoint, and if there has been
no WAL activity requiring a checkpoint, skip it."

>> -/* Should the in-progress insertion log the origin? */
>> -static bool include_origin = false;
>> +/* status flags of the in-progress insertion */
>> +static uint8 status_flags = 0;
>
> that seems a bit too generic a name. 'curinsert_flags'?

OK.

>>                       /*
>> -                      * only log if enough time has passed and some xlog 
>> record has
>> -                      * been inserted.
>> +                      * Only log if enough time has passed, that some WAL 
>> activity
>> +                      * has happened since last checkpoint, and that some 
>> new WAL
>> +                      * records have been inserted since the last time we 
>> came here.
>
> I think that sentence needs some polish.

Let's do this better:
            /*
-            * only log if enough time has passed and some xlog record has
-            * been inserted.
+            * Only log if one of the following conditions is satisfied since
+            * the last time we came here::
+            * - timeout has been reached.
+            * - WAL activity has happened since last checkpoint.
+            * - New WAL records have been inserted.
             */

>>                        */
>>                       if (now >= timeout &&
>> -                             last_snapshot_lsn != GetXLogInsertRecPtr())
>> +                             GetLastCheckpointRecPtr() < 
>> current_progress_lsn &&
>> +                             last_progress_lsn < current_progress_lsn)
>>                       {
>
> Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
> Don't we need to do the comparisons here (and when doing the checkpoint
> itself) with the REDO pointer of the last checkpoint?

Hm? The progress pointer is pointing to the lastly inserted LSN, which
is not the position of the REDO pointer, but the one of the checkpoint
record. Doing a comparison of the REDO pointer would be a moot
operation, because as the checkpoint completes, the progress LSN will
be updated as well. Or do you mean that the progress LSN should *not*
be updated for a checkpoint record? It seems to me that it should
but...

>> diff --git a/src/backend/postmaster/checkpointer.c 
>> b/src/backend/postmaster/checkpointer.c
>> index 397267c..7ecc00e 100644
>> --- a/src/backend/postmaster/checkpointer.c
>> +++ b/src/backend/postmaster/checkpointer.c
>> @@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
>>
>>  static pg_time_t last_checkpoint_time;
>>  static pg_time_t last_xlog_switch_time;
>> +static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
>
> Hm. Is it a good idea to use a static for this? Did you consider
> checkpointer restarts?

Indeed, I forgot about that and the current approach is not solid. The
best way to do things then is to track the LSN position of the last
switched segment in XLogCtl..
-- 
Michael

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..d2a8ec2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2826,17 +2826,16 @@ include_dir 'conf.d'
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
-        Note that archived files that are closed early
-        due to a forced switch are still the same length as completely full
-        files.  Therefore, it is unwise to use a very short
-        <varname>archive_timeout</> &mdash; it will bloat your archive
-        storage.  <varname>archive_timeout</> settings of a minute or so are
-        usually reasonable.  You should consider using streaming replication,
-        instead of archiving, if you want data to be copied off the master
-        server more quickly than that.
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.
+        Note that archived files that are closed early due to a forced switch
+        are still the same length as completely full files.  Therefore, it is
+        unwise to use a very short <varname>archive_timeout</> &mdash; it will
+        bloat your archive storage.  <varname>archive_timeout</> settings of
+        a minute or so are usually reasonable.  You should consider using
+        streaming replication, instead of archiving, if you want data to
+        be copied off the master server more quickly than that.
         This parameter can only be set in the
         <filename>postgresql.conf</> file or on the server command line.
        </para>
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b019bc1..ac40731 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2507,7 +2507,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 							heaptup->t_len - SizeofHeapTupleHeader);
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, info);
 
@@ -2846,7 +2846,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			XLogRegisterBufData(0, tupledata, totaldatalen);
 
 			/* filtering by origin on a row level is much more efficient */
-			XLogIncludeOrigin();
+			XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 			recptr = XLogInsert(RM_HEAP2_ID, info);
 
@@ -3308,7 +3308,7 @@ l1:
 		}
 
 		/* filtering by origin on a row level is much more efficient */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
@@ -6035,7 +6035,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
 		XLogBeginInsert();
 
 		/* We want the same filtering on this as on a plain insert */
-		XLogIncludeOrigin();
+		XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
@@ -7703,7 +7703,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	}
 
 	/* filtering by origin on a row level is much more efficient */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	recptr = XLogInsert(RM_HEAP_ID, info);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..9130816 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5232,7 +5232,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 		XLogRegisterData((char *) (&xl_origin), sizeof(xl_xact_origin));
 
 	/* we allow filtering by xacts */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_XACT_ID, info);
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..894596b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -442,11 +442,22 @@ typedef struct XLogwrtResult
  * the WAL record is just copied to the page and the lock is released. But
  * to avoid the deadlock-scenario explained above, the indicator is always
  * updated before sleeping while holding an insertion lock.
+ *
+ * The progressAt values indicate the insertion progress used to determine
+ * WAL insertion activity since a previous checkpoint, which is aimed at
+ * finding out if a checkpoint should be skipped or not or if standby
+ * activity should be logged. Progress position is basically updated
+ * for all types of records, for the time being only snapshot logging
+ * is out of this scope to properly skip their logging on idle systems.
+ * Tracking the WAL activity directly in WALInsertLock has the advantage
+ * to not rely on taking an exclusive lock on all the WAL insertion locks,
+ * hence reducing the impact of the activity lookup.
  */
 typedef struct
 {
 	LWLock		lock;
 	XLogRecPtr	insertingAt;
+	XLogRecPtr	progressAt;
 } WALInsertLock;
 
 /*
@@ -542,8 +553,9 @@ typedef struct XLogCtlData
 	XLogRecPtr	unloggedLSN;
 	slock_t		ulsn_lck;
 
-	/* Time of last xlog segment switch. Protected by WALWriteLock. */
+	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
+	XLogRecPtr	lastSegSwitchLSN;
 
 	/*
 	 * Protected by info_lck and WALWriteLock (you must hold either lock to
@@ -885,6 +897,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
  *
+ * 'flags' gives more in-depth control on the record being inserted. See
+ * XLogSetFlags() for more details.
+ *
  * The first XLogRecData in the chain must be for the record header, and its
  * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
  * xl_crc fields in the header, the rest of the header must already be filled
@@ -897,7 +912,9 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
+XLogInsertRecord(XLogRecData *rdata,
+				 XLogRecPtr fpw_lsn,
+				 uint8 flags)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	pg_crc32c	rdata_crc;
@@ -997,6 +1014,23 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
 		inserted = true;
 	}
 
+	/*
+	 * Update the LSN progress positions as at least one WAL insertion lock
+	 * is already taken appropriately before doing that. Progress is set at
+	 * the start position of the tracked record that is being added, making
+	 * checkpoint progress tracking easier as the control file already saves
+	 * the start LSN position of the last checkpoint. If an exclusive lock
+	 * is taken for WAL insertion there is no need to update all the progress
+	 * fields, only the first one.
+	 */
+	if ((flags & XLOG_SKIP_PROGRESS) == 0)
+	{
+		if (holdingAllLocks)
+			WALInsertLocks[0].l.progressAt = StartPos;
+		else
+			WALInsertLocks[MyLockNo].l.progressAt = StartPos;
+	}
+
 	if (inserted)
 	{
 		/*
@@ -2333,6 +2367,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					XLogArchiveNotifySeg(openLogSegNo);
 
 				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
 				/*
 				 * Request a checkpoint if we've consumed too much xlog since
@@ -4720,6 +4755,7 @@ XLOGShmemInit(void)
 	{
 		LWLockInitialize(&WALInsertLocks[i].l.lock, LWTRANCHE_WAL_INSERT);
 		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+		WALInsertLocks[i].l.progressAt = InvalidXLogRecPtr;
 	}
 
 	/*
@@ -7436,8 +7472,9 @@ StartupXLOG(void)
 	 */
 	InRecovery = false;
 
-	/* start the archive_timeout timer running */
+	/* start the archive_timeout timer and LSN running */
 	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
 
 	/* also initialize latestCompletedXid, to nextXid - 1 */
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
@@ -7999,16 +8036,66 @@ GetFlushRecPtr(void)
 }
 
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL activity position, or in
+ * other words any activity not requiring a checkpoint to be triggered.
+ * Finding the last activity position is done by scanning each WAL
+ * insertion lock by taking directly the light-weight lock associated
+ * to it.
+ */
+XLogRecPtr
+GetProgressRecPtr(void)
+{
+	XLogRecPtr	res = InvalidXLogRecPtr;
+	int			i;
+
+	/*
+	 * Look at the latest LSN position referring to the activity done by
+	 * WAL insertion. An exclusive lock is taken because currently the
+	 * locking logic for WAL insertion only expects such a level of locking.
+	 * Taking a lock is as well necessary to prevent potential torn reads
+	 * on some platforms.
+	 */
+	for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
+	{
+		XLogRecPtr	progress_lsn;
+
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		progress_lsn = WALInsertLocks[i].l.progressAt;
+		LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		if (res < progress_lsn)
+			res = progress_lsn;
+	}
+
+	return res;
+}
+
+/*
+ * GetLastCheckpointRecPtr -- Returns the last checkpoint insert position.
+ */
+XLogRecPtr
+GetLastCheckpointRecPtr(void)
+{
+	XLogRecPtr	ckpt_lsn;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	ckpt_lsn = ControlFile->checkPoint;
+	LWLockRelease(ControlFileLock);
+	return ckpt_lsn;
+}
+
+/*
+ * Get the time and LSN of the last xlog segment switch
  */
 pg_time_t
-GetLastSegSwitchTime(void)
+GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 {
 	pg_time_t	result;
 
 	/* Need WALWriteLock, but shared lock is sufficient */
 	LWLockAcquire(WALWriteLock, LW_SHARED);
 	result = XLogCtl->lastSegSwitchTime;
+	*lastSwitchLSN = XLogCtl->lastSegSwitchLSN;
 	LWLockRelease(WALWriteLock);
 
 	return result;
@@ -8258,7 +8345,7 @@ CreateCheckPoint(int flags)
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
 	XLogRecPtr	curInsert;
-	XLogRecPtr	prevPtr;
+	XLogRecPtr	progress_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
@@ -8339,35 +8426,32 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
+	 * Get progress before acquiring insert locks to shorten the locked
+	 * section waiting ahead.
+	 */
+	progress_lsn = GetProgressRecPtr();
+
+	/*
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-	prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
 
 	/*
-	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
-	 * any XLOG records since the start of the last checkpoint, skip the
-	 * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
-	 * when the system is idle. That wastes log space, and more importantly it
-	 * exposes us to possible loss of both current and previous checkpoint
-	 * records if the machine crashes just as we're writing the update.
-	 * (Perhaps it'd make even more sense to checkpoint only when the previous
-	 * checkpoint record is in a different xlog page?)
-	 *
-	 * If the previous checkpoint crossed a WAL segment, however, we create
-	 * the checkpoint anyway, to have the latest checkpoint fully contained in
-	 * the new segment. This is for a little bit of extra robustness: it's
-	 * better if you don't need to keep two WAL segments around to recover the
-	 * checkpoint.
+	 * If this isn't a shutdown or forced checkpoint, and if there has been
+	 * no WAL activity requiring a checkpoint, skip it.  The idea here is to
+	 * avoid inserting duplicate checkpoints when the system is idle. That
+	 * wastes log space, and more importantly it exposes us to possible loss
+	 * of both current and previous checkpoint records if the machine crashes
+	 * just as we're writing the update.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (prevPtr == ControlFile->checkPointCopy.redo &&
-			prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
+		if (progress_lsn == ControlFile->checkPoint)
 		{
+			ereport(DEBUG1, (errmsg("checkpoint skipped")));
 			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
@@ -9133,6 +9217,8 @@ RequestXLogSwitch(void)
 
 	/* XLOG SWITCH has no data */
 	XLogBeginInsert();
+
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 	RecPtr = XLogInsert(RM_XLOG_ID, XLOG_SWITCH);
 
 	return RecPtr;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3cd273b..720c754 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -73,8 +73,8 @@ static XLogRecData *mainrdata_head;
 static XLogRecData *mainrdata_last = (XLogRecData *) &mainrdata_head;
 static uint32 mainrdata_len;	/* total # of bytes in chain */
 
-/* Should the in-progress insertion log the origin? */
-static bool include_origin = false;
+/* status flags of the in-progress insertion */
+static uint8 curinsert_flags = 0;
 
 /*
  * These are used to hold the record header while constructing a record.
@@ -201,7 +201,7 @@ XLogResetInsertion(void)
 	max_registered_block_id = 0;
 	mainrdata_len = 0;
 	mainrdata_last = (XLogRecData *) &mainrdata_head;
-	include_origin = false;
+	curinsert_flags = 0;
 	begininsert_called = false;
 }
 
@@ -384,13 +384,20 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
 }
 
 /*
- * Should this record include the replication origin if one is set up?
+ * Set insert status flags for the upcoming WAL record.
+ *
+ * The flags that can be used here are:
+ * - XLOG_INCLUDE_ORIGIN, to determine if the replication origin should be
+ *   included in the record.
+ * - XLOG_SKIP_PROGRESS, to not update the WAL progress trackers when
+ *   inserting the record.
  */
 void
-XLogIncludeOrigin(void)
+XLogSetFlags(uint8 flags)
 {
 	Assert(begininsert_called);
-	include_origin = true;
+
+	curinsert_flags = flags;
 }
 
 /*
@@ -450,7 +457,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
 								 &fpw_lsn);
 
-		EndPos = XLogInsertRecord(rdt, fpw_lsn);
+		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags);
 	} while (EndPos == InvalidXLogRecPtr);
 
 	XLogResetInsertion();
@@ -701,7 +708,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	}
 
 	/* followed by the record's origin, if any */
-	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
+	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) != 0 &&
+		replorigin_session_origin != InvalidRepOriginId)
 	{
 		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index c3f3356..172129f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -78,12 +78,12 @@ int			BgWriterDelay = 200;
 #define LOG_SNAPSHOT_INTERVAL_MS 15000
 
 /*
- * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
- * doing so too often or repeatedly if there has been no other write activity
- * in the system.
+ * Last progress LSN and timestamp at which we last logged a standby
+ * snapshot, to avoid doing so too often or repeatedly if there has been
+ * no other write activity in the system.
  */
 static TimestampTz last_snapshot_ts;
-static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
+static XLogRecPtr last_progress_lsn = InvalidXLogRecPtr;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -308,7 +308,7 @@ BackgroundWriterMain(void)
 		 * check whether there has been any WAL inserted since the last time
 		 * we've logged a running xacts.
 		 *
-		 * We do this logging in the bgwriter as its the only process that is
+		 * We do this logging in the bgwriter as it is the only process that is
 		 * run regularly and returns to its mainloop all the time. E.g.
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
@@ -317,19 +317,25 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_progress_lsn = GetProgressRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
 
 			/*
-			 * only log if enough time has passed and some xlog record has
-			 * been inserted.
+			 * Only log if one of the following conditions is satisfied since
+			 * the last time we came here::
+			 * - timeout has been reached.
+			 * - WAL activity has happened since last checkpoint.
+			 * - New WAL records have been inserted.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn != GetXLogInsertRecPtr())
+				GetLastCheckpointRecPtr() < current_progress_lsn &&
+				last_progress_lsn < current_progress_lsn)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
+				(void) LogStandbySnapshot();
 				last_snapshot_ts = now;
+				last_progress_lsn = current_progress_lsn;
 			}
 		}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 397267c..e3feb17 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -580,6 +580,7 @@ CheckArchiveTimeout(void)
 {
 	pg_time_t	now;
 	pg_time_t	last_time;
+	XLogRecPtr	last_switch_lsn;
 
 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
 		return;
@@ -594,26 +595,37 @@ CheckArchiveTimeout(void)
 	 * Update local state ... note that last_xlog_switch_time is the last time
 	 * a switch was performed *or requested*.
 	 */
-	last_time = GetLastSegSwitchTime();
+	last_time = GetLastSegSwitchData(&last_switch_lsn);
 
 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
 
 	/* Now we can do the real check */
 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
 	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
 		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
+		 * Switch segment only when WAL has done some progress since the
+		 * last time a segment has switched because of a timeout. Segment
+		 * switching because of other reasons, like manual triggering of
+		 * pg_switch_xlog() as well as this automatic switch, will not
+		 * cause any progress in WAL.  Note that RequestXLogSwitch() may
+		 * return the beginning of a segment, which is fine to prevent
+		 * any unnecessary switches to happen.
 		 */
-		if ((switchpoint % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
+		if (GetProgressRecPtr() > last_switch_lsn)
+		{
+			XLogRecPtr	switchpoint;
+
+			switchpoint = RequestXLogSwitch();
+
+			/*
+			 * If the returned pointer points exactly to a segment boundary,
+			 * assume nothing happened.
+			 */
+			if ((switchpoint % XLogSegSize) != 0)
+				ereport(DEBUG1,
+						(errmsg("transaction log switch forced (archive_timeout=%d)",
+								XLogArchiveTimeout)));
+		}
 
 		/*
 		 * Update state in any case, so we don't retry constantly when the
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 8f9dc2f..c2d2bd8 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -73,7 +73,7 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 	XLogRegisterData((char *) message, size);
 
 	/* allow origin filtering */
-	XLogIncludeOrigin();
+	XLogSetFlags(XLOG_INCLUDE_ORIGIN);
 
 	return XLogInsert(RM_LOGICALMSG_ID, XLOG_LOGICAL_MESSAGE);
 }
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 875dcec..04ef7dd 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -964,7 +964,8 @@ LogStandbySnapshot(void)
  * The definitions of RunningTransactionsData and xl_xact_running_xacts
  * are similar. We keep them separate because xl_xact_running_xacts
  * is a contiguous chunk of memory and never exists fully until it is
- * assembled in WAL.
+ * assembled in WAL. Progress of WAL activity is not updated when
+ * this record is logged.
  */
 static XLogRecPtr
 LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
@@ -988,6 +989,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		XLogRegisterData((char *) CurrRunningXacts->xids,
 					   (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId));
 
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
+
 	recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);
 
 	if (CurrRunningXacts->subxid_overflow)
@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
 	XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
+	XLogSetFlags(XLOG_SKIP_PROGRESS);
 
 	(void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index c9f332c..aba00e2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -184,6 +184,12 @@ extern bool XLOG_DEBUG;
 #define CHECKPOINT_CAUSE_XLOG	0x0040	/* XLOG consumption */
 #define CHECKPOINT_CAUSE_TIME	0x0080	/* Elapsed time */
 
+/*
+ * Flag bits for the record currently inserted.
+ */
+#define XLOG_INCLUDE_ORIGIN	0x01	/* include the replication origin */
+#define XLOG_SKIP_PROGRESS	0x02	/* skip update progress LSN */
+
 /* Checkpoint statistics */
 typedef struct CheckpointStatsData
 {
@@ -211,7 +217,9 @@ extern CheckpointStatsData CheckpointStats;
 
 struct XLogRecData;
 
-extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata, XLogRecPtr fpw_lsn);
+extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
+								   XLogRecPtr fpw_lsn,
+								   uint8 flags);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
@@ -262,6 +270,8 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetProgressRecPtr(void);
+extern XLogRecPtr GetLastCheckpointRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
 extern void RemovePromoteSignalFiles(void);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..b2a8d03 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -283,7 +283,7 @@ extern const RmgrData RmgrTable[];
 /*
  * Exported to support xlog switching from checkpointer
  */
-extern pg_time_t GetLastSegSwitchTime(void);
+extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cc0177e..3f10919 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -40,7 +40,7 @@
 
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
-extern void XLogIncludeOrigin(void);
+extern void XLogSetFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
 extern void XLogEnsureRecordSpace(int nbuffers, int ndatas);
 extern void XLogRegisterData(char *data, int len);

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

Reply via email to