subject:"Add checkpoint and redo LSN to LogCheckpointEnd log message"

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-07-07 Thread Fujii Masao





On 2022/07/07 16:26, Kyotaro Horiguchi wrote:

+ * ControlFileLock is not required as we are the only
+* updator of these variables.

Isn't it better to add "at this time" or something at the end of the
comment because only we're not always updator of them? No?


Excluding initialization, (I believe) checkpointer is really the only
updator of the variables/members.  But I'm fine with the addition.


Ok, so I modified the patch slightly and pushed it. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-07-07 Thread Kyotaro Horiguchi

At Thu, 7 Jul 2022 01:11:33 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2022/03/16 10:29, Kyotaro Horiguchi wrote:
> > At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi
> >  wrote in
> >> In short, I split out the two topics other than checkpoint log to
> >> other threads.
> > So, this is about the main topic of this thread, adding LSNs to
> > checkpint log.  Other topics have moved to other treads [1], [2] ,
> > [3].
> > I think this is no longer controversial alone.  So this patch is now
> > really Read-for-Commiter and is waiting to be picked up.
> 
> +1
> 
> + * ControlFileLock is not required as we are the only
> +  * updator of these variables.
> 
> Isn't it better to add "at this time" or something at the end of the
> comment because only we're not always updator of them? No?

Excluding initialization, (I believe) checkpointer is really the only
updator of the variables/members.  But I'm fine with the addition.

> Barring any objection, I'm thinking to apply the above small change
> and commit the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-07-06 Thread Fujii Masao





On 2022/03/16 10:29, Kyotaro Horiguchi wrote:

At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in

In short, I split out the two topics other than checkpoint log to
other threads.


So, this is about the main topic of this thread, adding LSNs to
checkpint log.  Other topics have moved to other treads [1], [2] ,
[3].

I think this is no longer controversial alone.  So this patch is now
really Read-for-Commiter and is waiting to be picked up.


+1

+* ControlFileLock is not 
required as we are the only
+* updator of these variables.

Isn't it better to add "at this time" or something at the end of the comment 
because only we're not always updator of them? No?

Barring any objection, I'm thinking to apply the above small change and commit 
the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-28 Thread Greg Stark

This patch is currently showing up with a test failure in the CFBot
however I do *not* believe this is a bug in the patch. I think it's a
bug in that test which is being discussed elsewhere.

It's also a very short and straightforward patch that a committer
could probably make a decision about whether it's a good idea or not
and then apply it quickly if so.

Just to give people a leg up and an idea how short the patch is...
Here's the entire patch:


diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index ed16f279b1..b85c76d8f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6121,7 +6121,8 @@ LogCheckpointEnd(bool restartpoint)
  "%d WAL file(s) added, %d removed, %d recycled; "
  "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
  "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
- "distance=%d kB, estimate=%d kB",
+ "distance=%d kB, estimate=%d kB; "
+ "lsn=%X/%X, redo lsn=%X/%X",
  CheckpointStats.ckpt_bufs_written,
  (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  CheckpointStats.ckpt_segs_added,
@@ -6134,14 +6135,21 @@ LogCheckpointEnd(bool restartpoint)
  longest_msecs / 1000, (int) (longest_msecs % 1000),
  average_msecs / 1000, (int) (average_msecs % 1000),
  (int) (PrevCheckPointDistance / 1024.0),
- (int) (CheckPointDistanceEstimate / 1024.0;
+ (int) (CheckPointDistanceEstimate / 1024.0),
+ /*
+ * ControlFileLock is not required as we are the only
+ * updator of these variables.
+ */
+ LSN_FORMAT_ARGS(ControlFile->checkPoint),
+ LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
  else
  ereport(LOG,
  (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
  "%d WAL file(s) added, %d removed, %d recycled; "
  "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
  "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
- "distance=%d kB, estimate=%d kB",
+ "distance=%d kB, estimate=%d kB; "
+ "lsn=%X/%X, redo lsn=%X/%X",
  CheckpointStats.ckpt_bufs_written,
  (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
  CheckpointStats.ckpt_segs_added,
@@ -6154,7 +6162,13 @@ LogCheckpointEnd(bool restartpoint)
  longest_msecs / 1000, (int) (longest_msecs % 1000),
  average_msecs / 1000, (int) (average_msecs % 1000),
  (int) (PrevCheckPointDistance / 1024.0),
- (int) (CheckPointDistanceEstimate / 1024.0;
+ (int) (CheckPointDistanceEstimate / 1024.0),
+ /*
+ * ControlFileLock is not required as we are the only
+ * updator of these variables.
+ */
+ LSN_FORMAT_ARGS(ControlFile->checkPoint),
+ LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 }

 /*
-- 
2.27.0

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi

At Wed, 16 Mar 2022 09:19:13 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> In short, I split out the two topics other than checkpoint log to
> other threads.

So, this is about the main topic of this thread, adding LSNs to
checkpint log.  Other topics have moved to other treads [1], [2] ,
[3].

I think this is no longer controversial alone.  So this patch is now
really Read-for-Commiter and is waiting to be picked up.

regards.


[1] 
https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com
[2] 
https://www.postgresql.org/message-id/20220316.102900.2003692961119672246.horikyota.ntt%40gmail.com
[3] 
https://www.postgresql.org/message-id/20220316.102509.785466054344164656.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4c3d40e7aaf29cb905f3561a291d35981d762456 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Mar 2022 13:22:41 +0900
Subject: [PATCH v13] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ed16f279b1..b85c76d8f8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6121,7 +6121,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6134,14 +6135,21 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6154,7 +6162,13 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 }
 
 /*
-- 
2.27.0

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi

Hello, this is a follow-up topic of [1] (add LSNs to checkpint logs).

Many user-facing texts contains wording like "WAL location" or such
like. The attached is WIP patches that fixes such wordings to "WAL
LSN" or alikes.

The attached patches is v13 but it is not changed at all from v12.
My lastest comments on this version are as follows.

https://www.postgresql.org/message-id/20220209.115204.1794224638476710282.horikyota@gmail.com

> The old 0003 (attached 0004):
> 
> 
> 
> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
> - appendStringInfo(buf, "redo %X/%X; "
> + appendStringInfo(buf, "redo lsn %X/%X; "
> 
> 
> 
> It is shown in the context of a checkpoint record, so I think it is
> not needed or rather lengthning the dump line uselessly. 
> 
> 
> 
> +++ b/src/backend/access/transam/xlog.c
> - (errmsg("request to flush past end of generated 
> WAL; request %X/%X, current position %X/%X",
> + (errmsg("request to flush past end of generated 
> WAL; request lsn %X/%X, current lsn %X/%X",
> 
> 
> 
> +++ b/src/backend/replication/walsender.c
> - (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush position of this server %X/%X",
> + (errmsg("requested starting point %X/%X 
> is ahead of the WAL flush LSN of this server %X/%X",
> 
> 
> 
> "WAL" is upper-cased. So it seems rather strange that the "lsn" is
> lower-cased.  In the first place the message doesn't look like a
> user-facing error message and I feel we don't need position or lsn
> there..
> 
> 
> 
> +++ b/src/bin/pg_rewind/pg_rewind.c
> - pg_log_info("servers diverged at WAL location %X/%X on timeline 
> %u",
> + pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",
> 
> 
> 
> I feel that we don't need "WAL" there.
> 
> 
> 
> +++ b/src/bin/pg_waldump/pg_waldump.c
> - printf(_("  -e, --end=RECPTR   stop reading at WAL location 
> RECPTR\n"));
> + printf(_("  -e, --end=RECPTR   stop reading at WAL LSN RECPTR\n"));
> 
> 
> 
> Mmm.. "WAL LSN RECPTR" looks strange to me.  In the first place I
> don't think "RECPTR" is a user-facing term. Doesn't something like the
> follows work?
> 
> 
> 
> + printf(_("  -e, --end=WAL-LSN   stop reading at WAL-LSN\n"));
> 
> 
> 
> In some changes in this patch shorten the main message text of
> fprintf-ish functions.  That makes the succeeding parameters can be
> inlined.
regards.

[1] 
https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From afed063f09b788bc07a0e2665499c3e3c3edc214 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v13 1/2] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's REDO location:%X/%X\n"),
+	printf(_("Latest checkpoint's REDO LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo));
 	printf(_("Latest checkpoint's REDO WAL file:%s\n"),
 		   xlogfilename);
@@ -274,13 +274,13 @@ main(int argc, char *argv[])
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->unloggedLSN));
-	printf(_("Minimum recovery ending location: %X/%X\n"),
+	printf(_("Minimum recovery ending LSN:  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
-	printf(_("Backup start location:%X/%X\n"),
+	printf(_("Backup start LSN: %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
-	printf(_("Backup end location:  %X/%X\n"),
+	printf(_("Backup end LSN:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:%s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
diff --git a/src/test/recovery/t/016_min_consistency.pl b/src/test/recovery/t/016_min_consistency.pl
index 5e0655c2a9..4ee20309cd 100644
--- a/src/test/recovery/t/016_min_consistency.pl
++

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi

At Tue, 15 Mar 2022 18:26:26 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
> >  wrote in 
> >> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
> >>  wrote:
> >> 0001 - I don't think you need to do this as UpdateControlFile
> >> (update_controlfile) will anyway update it, no?
> >> + /* Update control file using current time */
> >> + ControlFile->time = (pg_time_t) time(NULL);
> > 
> > Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
> > same copy-pasto..
> 
> This thread has shifted to an entirely different discussion,
> presenting patches that touch code paths unrelated to what was first
> stated.  Shouldn't you create a new thread with a proper $subject to
> attract a more correct audience?  

I felt the same since some messages ago.  I thought Fujii-san thought
that he wants to fix the CreateRestartPoint issue before the
checkpoint log patch but he looks like busy these days.

Since the CreateRestartPoint issue is orthogonal to the main patch,
I'll separate that part into anther thread.

This thread is discussing one other topic, wordings in user-facing
texts. This is also orthogonal (3-dimentionally?) to the two topics.

In short, I split out the two topics other than checkpoint log to
other threads.

Thanks for cueing me to do that!

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Michael Paquier

On Tue, Mar 15, 2022 at 05:23:40PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 15 Mar 2022 12:19:47 +0530, Bharath Rupireddy 
>  wrote in 
>> On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
>>  wrote:
>> 0001 - I don't think you need to do this as UpdateControlFile
>> (update_controlfile) will anyway update it, no?
>> + /* Update control file using current time */
>> + ControlFile->time = (pg_time_t) time(NULL);
> 
> Ugh.. Yes. It is a copy-pasto from older versions.  They may have the
> same copy-pasto..

This thread has shifted to an entirely different discussion,
presenting patches that touch code paths unrelated to what was first
stated.  Shouldn't you create a new thread with a proper $subject to
attract a more correct audience?  
--
Michael


signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-15 Thread Kyotaro Horiguchi

+		ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+		ControlFile->minRecoveryPointTLI = 0;
+
+		/* also update local copy */
+		LocalMinRecoveryPoint = InvalidXLogRecPtr;
+		LocalMinRecoveryPointTLI = 0;
+	}
+
+	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
@@ -7104,7 +7123,7 @@ CreateRestartPoint(int flags)
 	xtime = GetLatestXTime();
 	ereport((log_checkpoints ? LOG : DEBUG2),
 			(errmsg("recovery restart point at %X/%X",
-	LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+	LSN_FORMAT_ARGS(RedoRecPtr)),
 			 xtime ? errdetail("Last completed transaction was at log time %s.",
 			   timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

>From dc1db06e3638f6c200c86bf09fded562cef6701a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Mar 2022 13:22:41 +0900
Subject: [PATCH v12 2/5] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bd962763cc..ce3815ea99 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6120,7 +6120,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6133,14 +6134,21 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6153,7 +6161,13 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/*
+		 * ControlFileLock is not required as we are the only
+		 * updator of these variables.
+		 */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 }
 
 /*
-- 
2.27.0

>From 1941cdf0d48194044ef8f30d28f532beb51526dc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 03:57:04 +
Subject: [PATCH v12 3/5] Change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..59f39267df 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:%X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_("Latest checkpoint's RE

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-14 Thread Bharath Rupireddy

On Fri, Mar 4, 2022 at 10:40 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 14 Feb 2022 14:52:15 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > In this version , 0001 gets one fix and two comment updates.
>
> While the disucssion on back-patching of 0001 proceeding on another
> branch, the main patch iself gets looks like as if rotten on
> CF-App. So I rebased v10 on the current master.  0001 is replaced by
> an adjusted patch based on the latest "control file update fix" patch.
>
> 0001: Fix possible incorrect controlfile update that leads to
>   unrecoverable database.

0001 - I don't think you need to do this as UpdateControlFile
(update_controlfile) will anyway update it, no?
+ /* Update control file using current time */
+ ControlFile->time = (pg_time_t) time(NULL);

> 0002: Add REDO/Checkpiont LSNs to checkpoinkt-end log message.
>   (The main patch in this thread)

0002 - If at all the intention is to say that no ControlFileLock is
required while reading ControlFile->checkPoint and
ControlFile->checkPointCopy.redo, let's say it, no? How about
something like "No ControlFileLock is required while reading
ControlFile->checkPoint and ControlFile->checkPointCopy.redo as there
can't be any other process updating them concurrently."?

+ /* we are the only updator of these variables */
+ LSN_FORMAT_ARGS(ControlFile->checkPoint),
+ LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;

> 0003: Replace (WAL-)location to LSN in pg_controldata.
>
> 0004: Replace (WAL-)location to LSN in user-facing texts.
>   (This doesn't reflect my recent comments.)

If you don't mind, can you please put the comments here?

> 0005: Unhyphenate the word archive-recovery and similars.

0005 - How about replacing "crash-recovery" to "crash recovery" in
postgres-ref.sgml too?

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-03-03 Thread Kyotaro Horiguchi

DB_IN_ARCHIVE_RECOVERY &&
-		ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
 	{
-		ControlFile->checkPoint = lastCheckPointRecPtr;
-		ControlFile->checkPointCopy = lastCheckPoint;
-
-		/*
-		 * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-		 * this will have happened already while writing out dirty buffers,
-		 * but not necessarily - e.g. because no buffers were dirtied.  We do
-		 * this because a non-exclusive base backup uses minRecoveryPoint to
-		 * determine which WAL files must be included in the backup, and the
-		 * file (or files) containing the checkpoint record must be included,
-		 * at a minimum. Note that for an ordinary restart of recovery there's
-		 * no value in having the minimum recovery point any earlier than this
-		 * anyway, because redo will begin just after the checkpoint record.
-		 */
 		if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
 		{
 			ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -7026,8 +7031,25 @@ CreateRestartPoint(int flags)
 		}
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-		UpdateControlFile();
 	}
+	else
+	{
+		/* recovery mode is not supposed to end during shutdown restartpoint */
+		Assert((flags & CHECKPOINT_IS_SHUTDOWN) == 0);
+
+		/*
+		 * Aarchive recovery has ended. Crash recovery ever after should
+		 * always recover to the end of WAL
+		 */
+		ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+		ControlFile->minRecoveryPointTLI = 0;
+
+		/* also update local copy */
+		LocalMinRecoveryPoint = InvalidXLogRecPtr;
+		LocalMinRecoveryPointTLI = 0;
+	}
+
+	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
@@ -7104,7 +7126,7 @@ CreateRestartPoint(int flags)
 	xtime = GetLatestXTime();
 	ereport((log_checkpoints ? LOG : DEBUG2),
 			(errmsg("recovery restart point at %X/%X",
-	LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+	LSN_FORMAT_ARGS(RedoRecPtr)),
 			 xtime ? errdetail("Last completed transaction was at log time %s.",
 			   timestamptz_to_str(xtime)) : 0));
 
-- 
2.27.0

>From 1dc663152ba00903f1bd25e7b28540fd97c41bd7 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Mar 2022 13:22:41 +0900
Subject: [PATCH v11 2/5] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3987aa81de..2e1123fd42 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6120,7 +6120,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6133,14 +6134,18 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/* we are the only updator of these variables */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -6153,7 +6158,10 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int)

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-25 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 16:52:30 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ugh! Wait for a moment. Something's wrong.

Sorry, what is wrong was my working directory.  It was broken by my
bogus operation. All the files apply corresponding versions correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 16:47:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> So, this is the patches for pg12-10.  11 can share the same patch with
> 12.  10 has differences in two points.
> 
> 10 has ControlFile->prevCheckPoint.
> 
> The DETAILS of the "recovery restart point at" message is not
> capitalized.  But I suppose it is so close to EOL so that we don't
> want to "fix" it risking existing usecases.

Ugh! Wait for a moment. Something's wrong.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 16:06:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > While making patches for v12, I see a test failure of pg_rewind for
> > uncertain reason. I'm investigating that but I post this for
> > discussion.
> 
> Hmm. Too stupid.  Somehow I overly removed the latchet condition for
> minRecoveryPoint.  So the same patch worked for v12.

So, this is the patches for pg12-10.  11 can share the same patch with
12.  10 has differences in two points.

10 has ControlFile->prevCheckPoint.

The DETAILS of the "recovery restart point at" message is not
capitalized.  But I suppose it is so close to EOL so that we don't
want to "fix" it risking existing usecases.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c89e2b509723b68897f2af49a154af2a69f0747b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 25 Feb 2022 15:04:00 +0900
Subject: [PATCH v3] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 71 +++
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 885558f291..2b2568c475 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9334,7 +9334,7 @@ CreateRestartPoint(int flags)
 
/* Also update the info_lck-protected copy */
SpinLockAcquire(&XLogCtl->info_lck);
-   XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+   XLogCtl->RedoRecPtr = RedoRecPtr;
SpinLockRelease(&XLogCtl->info_lck);
 
/*
@@ -9350,7 +9350,10 @@ CreateRestartPoint(int flags)
if (log_checkpoints)
LogCheckpointStart(flags, true);
 
-   CheckPointGuts(lastCheckPoint.redo, flags);
+   CheckPointGuts(RedoRecPtr, flags);
+
+   /* Update pg_control */
+   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 
/*
 * Remember the prior checkpoint's redo ptr for
@@ -9358,31 +9361,28 @@ CreateRestartPoint(int flags)
 */
PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+   Assert (PriorRedoPtr < RedoRecPtr);
+
+   ControlFile->checkPoint = lastCheckPointRecPtr;
+   ControlFile->checkPointCopy = lastCheckPoint;
+
+   /* Update control file using current time */
+   ControlFile->time = (pg_time_t) time(NULL);
+
/*
-* Update pg_control, using current time.  Check that it still shows
-* IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-* this is a quick hack to make sure nothing really bad happens if 
somehow
-* we get here after the end-of-recovery checkpoint.
+* Ensure minRecoveryPoint is past the checkpoint record while archive
+* recovery is still ongoing.  Normally, this will have happened already
+* while writing out dirty buffers, but not necessarily - e.g. because 
no
+* buffers were dirtied.  We do this because a non-exclusive base backup
+* uses minRecoveryPoint to determine which WAL files must be included 
in
+* the backup, and the file (or files) containing the checkpoint record
+* must be included, at a minimum. Note that for an ordinary restart of
+* recovery there's no value in having the minimum recovery point any
+* earlier than this anyway, because redo will begin just after the
+* checkpoint record.
 */
-   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
{
-   ControlFile->checkPoint = lastCheckPointRecPtr;
-   ControlFile->checkPointCopy = lastCheckPoint;
-   ControlFile->time = (pg_time_t) time(NULL);
-
-   /*
-* Ensure minRecoveryPoint is past the checkpoint record.  
Normally,
-* this will have happened already while writing out dirty 
buffers,
-* but not necessari

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi

At Fri, 25 Feb 2022 15:31:12 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> While making patches for v12, I see a test failure of pg_rewind for
> uncertain reason. I'm investigating that but I post this for
> discussion.

Hmm. Too stupid.  Somehow I overly removed the latchet condition for
minRecoveryPoint.  So the same patch worked for v12.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-24 Thread Kyotaro Horiguchi

At Tue, 22 Feb 2022 17:44:01 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao  
> wrote in 
> > 
> > > Of the following, I think we should do (a) and (b) to make future
> > > backpatchings easier.
> > > a) Use RedoRecPtr and PriorRedoPtr after they are assigned.
> > > b) Move assignment to PriorRedoPtr into the ControlFileLock section.
> > 
> > I failed to understand how (a) and (b) can make the backpatching
> > easier. How easy to backpatch seems the same whether we apply (a) and
> > (b) or not...
> 
> That premises that the patch applied to master contains (a) and (b).
> So if it doesn't, those are not need by older branches.

I was once going to remove them.  But according the discussion below,
the patch for back-patching is now quite close to that for the master
branch.  So I left them alone.

> > > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
> > >PriorRedoPtr.
> > 
> > But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a
> 
> I didn't believe that it happens. (So, it came from my
> convervativeness, or laziness, or both:p) The code dates from 2009 and
> StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at
> least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I
> think that won't happen.
> 
> So, in short, I agree to remove it or turn it into Assert().

It was a bit out of point.  If we assume RedoRecPtr is always larger
than PriorRedoPtr and then we don't need to check that there, we
should also remove the "if (PriorRedoPtr < RedoRecPtr)" branch just
above, which means the patch for back-branches gets very close to that
for the master.  Do we make such a large change on back branches?
Anyways this version once takes that way.

> > if (flags & CHECKPOINT_IS_SHUTDOWN)
> > ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> > 
> > Same as above. IMO it's safer and better to always update the state
> > (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if
> > CHECKPOINT_IS_SHUTDOWN flag is passed.
> 
> That means we may exit recovery mode after ShutdownXLOG called
> CreateRestartPoint. I don't think that may happen.  So I'd rather add
> Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed.

So this version for v14 gets updated in the following points.

Completely removed the code path for the case some other process runs
simultaneous checkpoint.

Removed the condition (RedoRecPtr > PriorRedoPtr) for
UpdateCheckPointDistanceEstimate() call.

Added an assertion to the recoery-end path.

# Honestly I feel this is a bit too much for back-patching, though.

While making patches for v12, I see a test failure of pg_rewind for
uncertain reason. I'm investigating that but I post this for
discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e983f3d4c2dbeea742aed0ef1e209e7821f6687f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 14 Feb 2022 13:04:33 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Along with that fix, we remove a dead code path for the case some
other process ran a simultaneous checkpoint.  It seems like just a
preventive measure but it's no longer useful because we are sure that
checkpoint is performed only by checkpointer except single process
mode.
---
 src/backend/access/transam/xlog.c | 73 ---
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6208e123e5..ff4a90eacc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9587,6 +9587,9 @@ CreateRestartPoint(int flags)
XLogSegNo   _logSegNo;
TimestampTz xtime;
 
+   /* we don't assume concurrent checkpoint/restartpoint to run */
+   Assert (!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
/* Get a local copy of the last safe checkpoint record. */
SpinLockAcquire(&XLogCtl->info_lck);
lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr;
@@ -9653,7 +9656,7 @@ CreateRestartPoint(int flags)
 
/* Also update the info_lck-protected copy */
SpinLockAcquire(&XLogCtl->info_lck);
-   XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+   XLogCtl->RedoRecPtr = RedoRecPtr;
SpinLockRelease(&XLogCtl->info_lck);
 
/*
@@ -9672,7 +9675,10 @@ CreateResta

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-22 Thread Kyotaro Horiguchi

At Tue, 22 Feb 2022 01:59:45 +0900, Fujii Masao  
wrote in 
> 
> > Of the following, I think we should do (a) and (b) to make future
> > backpatchings easier.
> > a) Use RedoRecPtr and PriorRedoPtr after they are assigned.
> > b) Move assignment to PriorRedoPtr into the ControlFileLock section.
> 
> I failed to understand how (a) and (b) can make the backpatching
> easier. How easy to backpatch seems the same whether we apply (a) and
> (b) or not...

That premises that the patch applied to master contains (a) and (b).
So if it doesn't, those are not need by older branches.


> > c) Skip udpate of minRecoveryPoint only when the checkpoint gets old.
> 
> Yes.
> 
> 
> > d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
> >PriorRedoPtr.
> 
> But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a

I didn't believe that it happens. (So, it came from my
convervativeness, or laziness, or both:p) The code dates from 2009 and
StartupXLOG makes a concurrent checkpoint with bgwriter. But as of at
least 9.5, StartupXLOG doesn't directly call CreateCheckPoint. So I
think that won't happen.

So, in short, I agree to remove it or turn it into Assert().

> restartpoint is skipped at the beginning of CreateRestartPoint() in
> that case. If this understanding is right, the check of "RedoRecPtr <=
> PriorRedoPtr" is not necessary before calling
> UpdateCheckPointDistanceEstimate().
> 
> 
> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> + ControlFile->minRecoveryPointTLI = 0;
> 
> Don't we need to update LocalMinRecoveryPoint and
> LocalMinRecoveryPointTLI after this? Maybe it's not necessary, but
> ISTM that it's safer and better to always update them whether the
> state is DB_IN_ARCHIVE_RECOVERY or not.

Agree that it's safer and tidy.

>   if (flags & CHECKPOINT_IS_SHUTDOWN)
>   ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> 
> Same as above. IMO it's safer and better to always update the state
> (whether the state is DB_IN_ARCHIVE_RECOVERY or not) if
> CHECKPOINT_IS_SHUTDOWN flag is passed.

That means we may exit recovery mode after ShutdownXLOG called
CreateRestartPoint. I don't think that may happen.  So I'd rather add
Assert ((flags&CHECKPOINT_IS_SHUTDOWN) == 0) there instaed.

I'll post the new version later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-21 Thread Fujii Masao





On 2022/02/14 14:40, Kyotaro Horiguchi wrote:

For backbranches, the attached for pg14 does part of the full patch.


Thanks for updating the patch!



Of the following, I think we should do (a) and (b) to make future
backpatchings easier.

a) Use RedoRecPtr and PriorRedoPtr after they are assigned.

b) Move assignment to PriorRedoPtr into the ControlFileLock section.


I failed to understand how (a) and (b) can make the backpatching easier. How 
easy to backpatch seems the same whether we apply (a) and (b) or not...



c) Skip udpate of minRecoveryPoint only when the checkpoint gets old.


Yes.



d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
   PriorRedoPtr.


But "RedoRecPtr <= PriorRedoPtr" will never happen, will it? Because a restartpoint is 
skipped at the beginning of CreateRestartPoint() in that case. If this understanding is right, the check 
of "RedoRecPtr <= PriorRedoPtr" is not necessary before calling 
UpdateCheckPointDistanceEstimate().


+   ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+   ControlFile->minRecoveryPointTLI = 0;

Don't we need to update LocalMinRecoveryPoint and LocalMinRecoveryPointTLI 
after this? Maybe it's not necessary, but ISTM that it's safer and better to 
always update them whether the state is DB_IN_ARCHIVE_RECOVERY or not.


if (flags & CHECKPOINT_IS_SHUTDOWN)
ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;

Same as above. IMO it's safer and better to always update the state (whether 
the state is DB_IN_ARCHIVE_RECOVERY or not) if CHECKPOINT_IS_SHUTDOWN flag is 
passed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-13 Thread Kyotaro Horiguchi

d,
-		 * at a minimum. Note that for an ordinary restart of recovery there's
-		 * no value in having the minimum recovery point any earlier than this
-		 * anyway, because redo will begin just after the checkpoint record.
-		 */
 		if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
 		{
 			ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9786,15 +9787,29 @@ CreateRestartPoint(int flags)
 		}
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-		UpdateControlFile();
 	}
+	else
+	{
+		/*
+		 * Archive recovery has ended. Crash recovery ever after should always
+		 * recover to the end of WAL.
+		 */
+		ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+		ControlFile->minRecoveryPointTLI = 0;
+	}
+	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
 	 */
-	if (PriorRedoPtr != InvalidXLogRecPtr)
+
+ 	/*
+ 	 * Update the average distance between checkpoints/restartpoints if the
+	 * prior checkpoint exists. The second term is just in case.
+ 	 */
+	if (PriorRedoPtr != InvalidXLogRecPtr && RedoRecPtr > PriorRedoPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
 	/*
@@ -9864,7 +9879,7 @@ CreateRestartPoint(int flags)
 	xtime = GetLatestXTime();
 	ereport((log_checkpoints ? LOG : DEBUG2),
 			(errmsg("recovery restart point at %X/%X",
-	LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+	LSN_FORMAT_ARGS(RedoRecPtr)),
 			 xtime ? errdetail("Last completed transaction was at log time %s.",
 			   timestamptz_to_str(xtime)) : 0));
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23f691cd47..8bd4c47b6b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -386,7 +386,6 @@ CheckpointerMain(void)
 
 			/* Check if we should perform a checkpoint or a restartpoint. */
 			do_restartpoint = RecoveryInProgress();
-
 			/*
 			 * Atomically fetch the request flags to figure out what kind of a
 			 * checkpoint we should perform, and increase the started-counter
-- 
2.27.0

>From 5a78c6a41996d0bc5c2fb21724caf789a0a83d46 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 04:34:54 +
Subject: [PATCH v10 2/5] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8c2882b49f..ba1b96bd1d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8923,7 +8923,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -8936,14 +8937,18 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/* we are the only updator of these variables */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -8956,7 +8961,10 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (ave

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-13 Thread Kyotaro Horiguchi

At Mon, 14 Feb 2022 14:40:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> It doesn't apply even on pg13 (due to LSN_FORMAT_ARGS).  I will make
> the per-version patches if you are fine with this.

Oops!  I forgot to rename the patch to avoid confusion on CF-bots.
I'll resend new version soon to avoid the confusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-13 Thread Kyotaro Horiguchi

At Fri, 11 Feb 2022 01:00:03 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2022/02/09 11:52, Kyotaro Horiguchi wrote:
> > 0001: The fix of CreateRestartPoint
> 
> This patch might be ok for the master branch. But since concurrent
> checkpoint and restartpoint can happen in v14 or before, we would need
> another patch based on that assumption, for the backport. How about
> fixing the bug all the branches at first, then apply this patch in the
> master to improve the code?

For backbranches, the attached for pg14 does part of the full patch.
Of the following, I think we should do (a) and (b) to make future
backpatchings easier.

a) Use RedoRecPtr and PriorRedoPtr after they are assigned.

b) Move assignment to PriorRedoPtr into the ControlFileLock section.

c) Skip udpate of minRecoveryPoint only when the checkpoint gets old.

d) Skip call to UpdateCheckPointDistanceEstimate() when RedoRecPtr <=
  PriorRedoPtr.

# Mmm. The v9-0001  contains a silly mistake here..

Still I'm not sure whether that case really happens and how checkpoint
behaves *after* that happenes, but at least it protects database from
the possible unrecoverable state due to the known issue here..

It doesn't apply even on pg13 (due to LSN_FORMAT_ARGS).  I will make
the per-version patches if you are fine with this.

regards.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From df516b487617c570332ea076b91773b810629a11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 14 Feb 2022 13:04:33 +0900
Subject: [PATCH v2] Correctly update contfol file at the end of archive
 recovery

CreateRestartPoint runs WAL file cleanup basing on the checkpoint just
have finished in the function.  If the database has exited
DB_IN_ARCHIVE_RECOVERY state when the function is going to update
control file, the function refrains from updating the file at all then
proceeds to WAL cleanup having the latest REDO LSN, which is now
inconsistent with the control file.  As the result, the succeeding
cleanup procedure overly removes WAL files against the control file
and leaves unrecoverable database until the next checkpoint finishes.

Since all buffers have flushed out, we may safely regard that restart
point established regardless of the recovery state.  Thus we may and
should update checkpoint LSNs of checkpoint file in that case.  Still
we update minRecoveryPoint only during archive recovery but explicitly
clear if we have exited recovery.

Addition to that fix, this commit makes some cosmetic changes that
consist with the changes we are going to make on the master branch.
---
 src/backend/access/transam/xlog.c | 81 ---
 1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6208e123e5..28c3c4b7cf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9653,7 +9653,7 @@ CreateRestartPoint(int flags)

 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->RedoRecPtr = lastCheckPoint.redo;
+	XLogCtl->RedoRecPtr = RedoRecPtr;
 	SpinLockRelease(&XLogCtl->info_lck);

 	/*
@@ -9672,7 +9672,10 @@ CreateRestartPoint(int flags)
 	/* Update the process title */
 	update_checkpoint_display(flags, true, false);

-	CheckPointGuts(lastCheckPoint.redo, flags);
+	CheckPointGuts(RedoRecPtr, flags);
+
+	/* Update pg_control */
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);

 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -9681,52 +9684,74 @@ CreateRestartPoint(int flags)
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;

 	/*
-	 * Update pg_control, using current time.  Check that it still shows
-	 * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
-	 * this is a quick hack to make sure nothing really bad happens if somehow
-	 * we get here after the end-of-recovery checkpoint.
+	 * Update pg_control, using current time if no later checkpoints have been
+	 * performed.
 	 */
-	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-		ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+	if (PriorRedoPtr < RedoRecPtr)
 	{
 		ControlFile->checkPoint = lastCheckPointRecPtr;
 		ControlFile->checkPointCopy = lastCheckPoint;
 		ControlFile->time = (pg_time_t) time(NULL);

 		/*
-		 * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-		 * this will have happened already while writing out dirty buffers,
-		 * but not necessarily - e.g. because no buffers were dirtied.  We do
-		 * this because a non-exclusive base backup uses minRecoveryPoint to
-		 * determine which WAL files must be included in the backup, and the
-		 * file (or files) containing the checkpoint record must be included,
-		 * at a minimum. Note that for an ordinary restart of recovery there's
-		 * no value in having the minimum recovery point any earlier than this
+		 * Ensure minRecoveryPoint is past the checkp

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-10 Thread Fujii Masao





On 2022/02/09 11:52, Kyotaro Horiguchi wrote:

0001: The fix of CreateRestartPoint


This patch might be ok for the master branch. But since concurrent checkpoint 
and restartpoint can happen in v14 or before, we would need another patch based 
on that assumption, for the backport. How about fixing the bug all the branches 
at first, then apply this patch in the master to improve the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-08 Thread Kyotaro Horiguchi

	ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
 	{
-		ControlFile->checkPoint = lastCheckPointRecPtr;
-		ControlFile->checkPointCopy = lastCheckPoint;
-
-		/*
-		 * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
-		 * this will have happened already while writing out dirty buffers,
-		 * but not necessarily - e.g. because no buffers were dirtied.  We do
-		 * this because a non-exclusive base backup uses minRecoveryPoint to
-		 * determine which WAL files must be included in the backup, and the
-		 * file (or files) containing the checkpoint record must be included,
-		 * at a minimum. Note that for an ordinary restart of recovery there's
-		 * no value in having the minimum recovery point any earlier than this
-		 * anyway, because redo will begin just after the checkpoint record.
-		 */
 		if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
 		{
 			ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
@@ -9786,15 +9787,26 @@ CreateRestartPoint(int flags)
 		}
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
 			ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
-		UpdateControlFile();
 	}
+	else
+	{
+		/*
+		 * Archive recovery have ended. Crash recovery ever after should always
+		 * recover to the end of WAL.
+		 */
+		ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
+		ControlFile->minRecoveryPointTLI = 0;
+	}
+	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
 	 */
-	if (PriorRedoPtr != InvalidXLogRecPtr)
+
+	/* the second term is just in case */
+	if (PriorRedoPtr != InvalidXLogRecPtr || RedoRecPtr > PriorRedoPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
 	/*
@@ -9864,7 +9876,7 @@ CreateRestartPoint(int flags)
 	xtime = GetLatestXTime();
 	ereport((log_checkpoints ? LOG : DEBUG2),
 			(errmsg("recovery restart point at %X/%X",
-	LSN_FORMAT_ARGS(lastCheckPoint.redo)),
+	LSN_FORMAT_ARGS(RedoRecPtr)),
 			 xtime ? errdetail("Last completed transaction was at log time %s.",
 			   timestamptz_to_str(xtime)) : 0));
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 23f691cd47..8bd4c47b6b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -386,7 +386,6 @@ CheckpointerMain(void)
 
 			/* Check if we should perform a checkpoint or a restartpoint. */
 			do_restartpoint = RecoveryInProgress();
-
 			/*
 			 * Atomically fetch the request flags to figure out what kind of a
 			 * checkpoint we should perform, and increase the started-counter
-- 
2.27.0

>From 0a8f18e4c13d67d246e39781fadc1717b4f8dcf4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 1 Feb 2022 04:34:54 +
Subject: [PATCH v9 2/4] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6c6e67f80..60da34433b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8923,7 +8923,8 @@ LogCheckpointEnd(bool restartpoint)
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
-		"distance=%d kB, estimate=%d kB",
+		"distance=%d kB, estimate=%d kB; "
+		"lsn=%X/%X, redo lsn=%X/%X",
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -8936,14 +8937,18 @@ LogCheckpointEnd(bool restartpoint)
 		longest_msecs / 1000, (int) (longest_msecs % 1000),
 		average_msecs / 1000, (int) (average_msecs % 1000),
 		(int) (PrevCheckPointDistance / 1024.0),
-		(int) (CheckPointDistanceEstimate / 1024.0;
+		(int) (CheckPointDistanceEstimate / 1024.0),
+		/* we are the only updator of these variables */
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo;
 	else
 		ereport(LOG,
 (errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-08 Thread Bharath Rupireddy

On Tue, Feb 8, 2022 at 2:10 PM Kyotaro Horiguchi
 wrote:
> > Thus.. the attached removes the ambiguity of of the proposed patch
> > about the LSNs in the restartpoint-ending log message.
>
> Thoughts?

Thanks for the patch. I have few comments on the
v1-0001-Get-rid-of-unused-path-to-handle-concurrent-check.patch

1) Can we have this Assert right after "skipping restartpoint, already
performed at %X/%X" error message block? Does it make any difference?
My point is that if at all, we were to assert this, why can't we do it
before CheckPointGuts?
+ /* We mustn't have a concurrent checkpoint that advances checkpoint LSN */
+ Assert(lastCheckPoint.redo > ControlFile->checkPointCopy.redo);
+
2) Related to the above Assert, do we really need an assertion or a FATAL error?
3) Let's be consistent with "crash recovery" - replace
"archive-recovery" with "archive recovery"?
+ * We have exited from archive-recovery mode after this restartpoint
+ * started. Crash recovery ever after should always recover to the end
4) Isn't it enough to say "Crash recovery should always recover to the
end of WAL."?
+ * started. Crash recovery ever after should always recover to the end
5) Is there a reliable test case covering this code? Please point me
if the test case is shared upthread somewhere.
6) So, with this patch, the v8 patch-set posted at [1] doesn't need
any changes IIUC. If that's the case, please feel free to post all the
patches together such that they get tested in cfbot.

[1] - 
https://www.postgresql.org/message-id/CALj2ACUtZhTb%3D2ENkF3BQ3wi137uaGi__qzvXC-qFYC0XwjALw%40mail.gmail.com

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-08 Thread Kyotaro Horiguchi

Mmm.. checkpoint and checkpointer are quite confusing in this context..

At Tue, 08 Feb 2022 16:58:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Feb 2022 13:51:31 +0900, Fujii Masao  
> wrote in 
> > 
> > 
> > On 2022/02/07 12:02, Kyotaro Horiguchi wrote:
> > > - If any later checkpoint/restartpoint has been established, just skip
> > >remaining task then return false. (!chkpt_was_latest)
> > >(I'm not sure this can happen, though.)
> > > - we update control file only when archive recovery is still ongoing.
> > 
> > This comment seems to conflict with what your PoC patch does. Because
> > with the patch, ControlFile->checkPoint and
> > ControlFile->checkPointCopy seem to be updated even when
> > ControlFile->state != DB_IN_ARCHIVE_RECOVERY.
> 
> Ah, yeah, by "update" I meant that "move forward". Sorry for confusing
> wording.

Please ignore the "that".

> > I agree with what your PoC patch does for now. When we're not in
> > archive recovery state, checkpoint and REDO locations in pg_control
> > should be updated but min recovery point should be reset to invalid
> > one (which instruments that subsequent crash recovery should replay
> > all available WAL files).
> 
> Yes.  All buffers before the last recovery point's end have been
> flushed out so the recovery point is valid as a checkpoint.  On the
> other hand minRecoveryPoint is no longer needed and actually is
> ignored at the next crash recovery.  We can leave it alone but it is
> consistent that it is cleared.
> 
> > > - Otherwise reset minRecoveryPoint then continue.
> > > Do you have any thoughts or opinions?
> > 
> > Regarding chkpt_was_latest, whether the state is
> > DB_IN_ARCHIVE_RECOVERY or not, if checkpoint and redo locations in
> > pg_control are updated, IMO we don't need to skip the "remaining
> > tasks". Since those locations are updated and subsequent crash
> > recovery will start from that redo location, for example, ISTM that we
> > can safely delete old WAL files prior to the redo location as the
> > "remaining tasks". Thought?
> 
> If I read you correctly, the PoC works that way. It updates pg_control
> if the restart point is latest then performs the remaining cleanup
> tasks in that case. Recovery state doesn't affect this process.
> 
> I reexamined about the possibility of concurrent checkpoints.
> 
> Both CreateCheckPoint and CreateRestartPoint are called from
> checkpointer loop, shutdown handler of checkpointer and standalone
> process.  So I can't see a possibility of concurrent checkpoints.
> 
> In the past we had a time when startup process called CreateCheckPoint

- directly in the crash recovery case where checkpoint is not running
- but since 7ff23c6d27 checkpoint is started before startup process
+ directly in the crash recovery case where checkpointer is not running
+ but since 7ff23c6d27 checkpointer is launched before startup process

> starts.  So I conclude that that cannot happen.
> 
> So the attached takes away the path for the case where the restart
> point is overtaken by a concurrent checkpoint.
> 
> Thus.. the attached removes the ambiguity of of the proposed patch
> about the LSNs in the restartpoint-ending log message.

Thoughts?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-07 Thread Kyotaro Horiguchi

At Mon, 7 Feb 2022 13:51:31 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2022/02/07 12:02, Kyotaro Horiguchi wrote:
> > - If any later checkpoint/restartpoint has been established, just skip
> >remaining task then return false. (!chkpt_was_latest)
> >(I'm not sure this can happen, though.)
> > - we update control file only when archive recovery is still ongoing.
> 
> This comment seems to conflict with what your PoC patch does. Because
> with the patch, ControlFile->checkPoint and
> ControlFile->checkPointCopy seem to be updated even when
> ControlFile->state != DB_IN_ARCHIVE_RECOVERY.

Ah, yeah, by "update" I meant that "move forward". Sorry for confusing
wording.

> I agree with what your PoC patch does for now. When we're not in
> archive recovery state, checkpoint and REDO locations in pg_control
> should be updated but min recovery point should be reset to invalid
> one (which instruments that subsequent crash recovery should replay
> all available WAL files).

Yes.  All buffers before the last recovery point's end have been
flushed out so the recovery point is valid as a checkpoint.  On the
other hand minRecoveryPoint is no longer needed and actually is
ignored at the next crash recovery.  We can leave it alone but it is
consistent that it is cleared.

> > - Otherwise reset minRecoveryPoint then continue.
> > Do you have any thoughts or opinions?
> 
> Regarding chkpt_was_latest, whether the state is
> DB_IN_ARCHIVE_RECOVERY or not, if checkpoint and redo locations in
> pg_control are updated, IMO we don't need to skip the "remaining
> tasks". Since those locations are updated and subsequent crash
> recovery will start from that redo location, for example, ISTM that we
> can safely delete old WAL files prior to the redo location as the
> "remaining tasks". Thought?

If I read you correctly, the PoC works that way. It updates pg_control
if the restart point is latest then performs the remaining cleanup
tasks in that case. Recovery state doesn't affect this process.

I reexamined about the possibility of concurrent checkpoints.

Both CreateCheckPoint and CreateRestartPoint are called from
checkpointer loop, shutdown handler of checkpointer and standalone
process.  So I can't see a possibility of concurrent checkpoints.

In the past we had a time when startup process called CreateCheckPoint
directly in the crash recovery case where checkpoint is not running
but since 7ff23c6d27 checkpoint is started before startup process
starts.  So I conclude that that cannot happen.

So the attached takes away the path for the case where the restart
point is overtaken by a concurrent checkpoint.

Thus.. the attached removes the ambiguity of of the proposed patch
about the LSNs in the restartpoint-ending log message.

Thoughts?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4369846b6de00fbddb215300a8ff774bbc04 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 8 Feb 2022 16:42:53 +0900
Subject: [PATCH v1] Get rid of unused path to handle concurrent checkpoints

CreateRestartPoint considered the case a concurrent checkpoint is
running. But after 7ff23c6d27 we no longer launch multiple checkpoints
simultaneously.  That code path, if it is passed, may leave
unrecoverable database by removing WAL segments that are required by
the last established restartpoint.
---
 src/backend/access/transam/xlog.c | 53 +++
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 958220c495..01a345e8bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9752,29 +9752,30 @@ CreateRestartPoint(int flags)
PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
/*
-* Update pg_control, using current time.  Check that it still shows
-* DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do 
nothing;
-* this is a quick hack to make sure nothing really bad happens if 
somehow
-* we get here after the end-of-recovery checkpoint.
+* Update pg_control, using current time.
 */
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+
+   /* We mustn't have a concurrent checkpoint that advances checkpoint LSN 
*/
+   Assert(lastCheckPoint.redo > ControlFile->checkPointCopy.redo);
+
+   ControlFile->checkPoint = lastCheckPointRecPtr;
+   ControlFile->checkPointCopy = lastCheckPoint;
+
+   /*
+* Ensure minRecoveryPoint is past the checkpoint record while archive
+* recovery is still ongoing.  Normally, this will have happened already
+* while writing out dirty buffers, but not necessarily - e.g. because 
no
+* buffers were dirtied.  We do this because a non-exclusive base backup
+* uses minRecoveryPoint to determine whic

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-06 Thread Fujii Masao





On 2022/02/07 12:02, Kyotaro Horiguchi wrote:

- If any later checkpoint/restartpoint has been established, just skip
   remaining task then return false. (!chkpt_was_latest)
   (I'm not sure this can happen, though.)

- we update control file only when archive recovery is still ongoing.


This comment seems to conflict with what your PoC patch does. Because with the patch, 
ControlFile->checkPoint and ControlFile->checkPointCopy seem to be updated even 
when ControlFile->state != DB_IN_ARCHIVE_RECOVERY.

I agree with what your PoC patch does for now. When we're not in archive 
recovery state, checkpoint and REDO locations in pg_control should be updated 
but min recovery point should be reset to invalid one (which instruments that 
subsequent crash recovery should replay all available WAL files).


- Otherwise reset minRecoveryPoint then continue.

Do you have any thoughts or opinions?


Regarding chkpt_was_latest, whether the state is DB_IN_ARCHIVE_RECOVERY or not, if checkpoint and 
redo locations in pg_control are updated, IMO we don't need to skip the "remaining 
tasks". Since those locations are updated and subsequent crash recovery will start from that 
redo location, for example, ISTM that we can safely delete old WAL files prior to the redo location 
as the "remaining tasks". Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-06 Thread Kyotaro Horiguchi

At Mon, 07 Feb 2022 10:16:34 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 4 Feb 2022 10:59:04 +0900, Fujii Masao  
> wrote in 
> > On 2022/02/03 15:50, Kyotaro Horiguchi wrote:
> > > By the way, restart point should start only while recoverying, and at
> > > the timeof the start both checkpoint.redo and checkpoint LSN are
> > > already past. We shouldn't update minRecovery point after promotion,
> > > but is there any reason for not updating the checkPoint and
> > > checkPointCopy?  If we update them after promotion, the
> > > which-LSN-to-show problem would be gone.
> > 
> > I tried to find the reason by reading the past discussion, but have
> > not found that yet.
> > 
> > If we update checkpoint and REDO LSN at pg_control in that case, we
> > also need to update min recovery point at pg_control? Otherwise the
> > min recovery point at pg_control still indicates the old LSN that
> > previous restart point set.
> 
> I had an assuption that the reason I think it shouldn't update
> minRecoveryPoint is that it has been or is going to be reset to
> invalid LSN by promotion and the checkpoint should refrain from
> touching it.

Hmm.. It doesn't seem to be the case. If a server crashes just after
promotion and before requesting post-promtion checkpoint,
minRecoveryPoint stays at a valid LSN.

(Promoted at 0/728)
Database cluster state:   in production
Latest checkpoint location:   0/660
Latest checkpoint's REDO location:0/628
Latest checkpoint's REDO WAL file:00010006
Minimum recovery ending location: 0/790
Min recovery ending loc's timeline:   2

minRecoveryPoint/TLI are ignored in any case where a server in
in-production state is started.  In other words, the values are
useless.  There's no clear or written reason for unrecording the last
ongoing restartpoint after promotion.

Before fast-promotion was introduced, we shouldn't get there after
end-of-recovery checkpoint (but somehow reached sometimes?) but it is
quite normal nowadays.  Or to the contrary, we're expecting it to
happen and it is regarded as a normal checkponit. So we should do
there nowadays are as the follows.

- If any later checkpoint/restartpoint has been established, just skip
  remaining task then return false. (!chkpt_was_latest)
  (I'm not sure this can happen, though.)

- we update control file only when archive recovery is still ongoing.

- Otherwise reset minRecoveryPoint then continue.

Do you have any thoughts or opinions?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 958220c495..ab8a4d9a1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9658,6 +9658,7 @@ CreateRestartPoint(int flags)
XLogRecPtr  endptr;
XLogSegNo   _logSegNo;
TimestampTz xtime;
+   boolchkpt_was_latest = false;
 
/* Get a local copy of the last safe checkpoint record. */
SpinLockAcquire(&XLogCtl->info_lck);
@@ -9752,44 +9753,69 @@ CreateRestartPoint(int flags)
PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
/*
-* Update pg_control, using current time.  Check that it still shows
-* DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do 
nothing;
-* this is a quick hack to make sure nothing really bad happens if 
somehow
-* we get here after the end-of-recovery checkpoint.
+* Update pg_control, using current time if no later checkpoints have 
been
+* performed.
 */
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
-   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
+   if (ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
{
+   chkpt_was_latest = true;
ControlFile->checkPoint = lastCheckPointRecPtr;
ControlFile->checkPointCopy = lastCheckPoint;
 
/*
-* Ensure minRecoveryPoint is past the checkpoint record.  
Normally,
-* this will have happened already while writing out dirty 
buffers,
-* but not necessarily - e.g. because no buffers were dirtied.  
We do
-* this because a non-exclusive base backup uses 
minRecoveryPoint to
-* determine which WAL files must be included in the backup, 
and the
-* file (or files) containing the checkpoint record must be 
included,
-* at a minimum. Note that for an ordinary restart of recovery 
there's
-* no value in having the minimum recovery point any earlier 
than this
+* Ensure minRecoveryPoint is past the checkpoint record while 
archive
+* recovery is still ongoing.  Normally, this will have happened
+* already while writing out dirty

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-06 Thread Kyotaro Horiguchi

At Fri, 4 Feb 2022 10:59:04 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2022/02/03 15:50, Kyotaro Horiguchi wrote:
> > On way to take. In that case should we log something like
> > "Restartpoint canceled" or something?
> 
> +1
> 
> 
> > By the way, restart point should start only while recoverying, and at
> > the timeof the start both checkpoint.redo and checkpoint LSN are
> > already past. We shouldn't update minRecovery point after promotion,
> > but is there any reason for not updating the checkPoint and
> > checkPointCopy?  If we update them after promotion, the
> > which-LSN-to-show problem would be gone.
> 
> I tried to find the reason by reading the past discussion, but have
> not found that yet.
> 
> If we update checkpoint and REDO LSN at pg_control in that case, we
> also need to update min recovery point at pg_control? Otherwise the
> min recovery point at pg_control still indicates the old LSN that
> previous restart point set.

I had an assuption that the reason I think it shouldn't update
minRecoveryPoint is that it has been or is going to be reset to
invalid LSN by promotion and the checkpoint should refrain from
touching it.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-03 Thread Fujii Masao





On 2022/02/03 15:50, Kyotaro Horiguchi wrote:

On way to take. In that case should we log something like
"Restartpoint canceled" or something?


+1



By the way, restart point should start only while recoverying, and at
the timeof the start both checkpoint.redo and checkpoint LSN are
already past. We shouldn't update minRecovery point after promotion,
but is there any reason for not updating the checkPoint and
checkPointCopy?  If we update them after promotion, the
which-LSN-to-show problem would be gone.


I tried to find the reason by reading the past discussion, but have not found 
that yet.

If we update checkpoint and REDO LSN at pg_control in that case, we also need 
to update min recovery point at pg_control? Otherwise the min recovery point at 
pg_control still indicates the old LSN that previous restart point set.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-02 Thread Kyotaro Horiguchi

At Thu, 3 Feb 2022 13:59:03 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2022/02/02 23:46, Bharath Rupireddy wrote:
> > On Tue, Feb 1, 2022 at 9:39 PM Fujii Masao
> >  wrote:
> >> I found that CreateRestartPoint() already reported the redo lsn as
> >> follows after emitting the restartpoint log message. To avoid
> >> duplicated logging of the same information, we should update this
> >> code?
> >>
> >>  ereport((log_checkpoints ? LOG : DEBUG2),
> >>  (errmsg("recovery restart point at %X/%X",
> >>  
> >> LSN_FORMAT_ARGS(lastCheckPoint.redo)),
> >>   xtime ? errdetail("Last completed transaction was
> >>   at log time %s.",
> >> 
> >> timestamptz_to_str(xtime))
> >> : 0));
> >>
> >> This code reports lastCheckPoint.redo as redo lsn. OTOH, with the
> >> patch, LogCheckpointEnd() reports
> >> ControlFile->checkPointCopy.redo. They may be different, for example,
> >> when the current DB state is not DB_IN_ARCHIVE_RECOVERY. In this case,
> >> which lsn should we report as redo lsn?
> > Do we ever reach CreateRestartPoint when ControlFile->stat !=
> > DB_IN_ARCHIVE_RECOVERY? Assert(ControlFile->state ==
> > DB_IN_ARCHIVE_RECOVERY); in CreateRestartPoint doesn't fail any
> > regression tests.
> 
> ISTM that CreateRestartPoint() can reach the condition
> ControlFile->state != DB_IN_ARCHIVE_RECOVERY. Please imagine the case
> where CreateRestartPoint() has already started and calls
> CheckPointGuts(). If the standby server is promoted while
> CreateRestartPoint() is flushing the data to disk at CheckPointGuts(),
> the state would be updated to DB_IN_PRODUCTION and
> CreateRestartPoint() can see the state != DB_IN_ARCHIVE_RECOVERY
> later.

By the way a comment there:
> * this is a quick hack to make sure nothing really bad happens if somehow
> * we get here after the end-of-recovery checkpoint.

now looks a bit wrong since now it's normal that a restartpoint ends
after promotion.

> As far as I read the code, this case seems to be able to make the
> server unrecoverable. If this case happens, since pg_control is not
> updated, pg_control still indicates the REDO LSN of last valid
> restartpoint. But CreateRestartPoint() seems to delete old WAL files
> based on its "current" REDO LSN not pg_control's REDO LSN. That is,
> WAL files required for the crash recovery starting from pg_control's
> REDO LSN would be removed.

Seems right. (I didn't confirm the behavior, though )

> If this understanding is right, to address this issue, probably we
> need to make CreateRestartPoint() do nothing (return immediately) when
> the state != DB_IN_ARCHIVE_RECOVERY?

On way to take. In that case should we log something like
"Restartpoint canceled" or something? 

By the way, restart point should start only while recoverying, and at
the timeof the start both checkpoint.redo and checkpoint LSN are
already past. We shouldn't update minRecovery point after promotion,
but is there any reason for not updating the checkPoint and
checkPointCopy?  If we update them after promotion, the
which-LSN-to-show problem would be gone.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-02 Thread Fujii Masao





On 2022/02/02 23:46, Bharath Rupireddy wrote:

On Tue, Feb 1, 2022 at 9:39 PM Fujii Masao  wrote:

I found that CreateRestartPoint() already reported the redo lsn as follows 
after emitting the restartpoint log message. To avoid duplicated logging of the 
same information, we should update this code?

 ereport((log_checkpoints ? LOG : DEBUG2),
 (errmsg("recovery restart point at %X/%X",
 LSN_FORMAT_ARGS(lastCheckPoint.redo)),
  xtime ? errdetail("Last completed transaction was at log 
time %s.",

timestamptz_to_str(xtime)) : 0));

This code reports lastCheckPoint.redo as redo lsn. OTOH, with the patch, 
LogCheckpointEnd() reports ControlFile->checkPointCopy.redo. They may be 
different, for example, when the current DB state is not DB_IN_ARCHIVE_RECOVERY. 
In this case, which lsn should we report as redo lsn?


Do we ever reach CreateRestartPoint when ControlFile->stat !=
DB_IN_ARCHIVE_RECOVERY? Assert(ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY); in CreateRestartPoint doesn't fail any
regression tests.


ISTM that CreateRestartPoint() can reach the condition ControlFile->state != 
DB_IN_ARCHIVE_RECOVERY. Please imagine the case where CreateRestartPoint() has 
already started and calls CheckPointGuts(). If the standby server is promoted 
while CreateRestartPoint() is flushing the data to disk at CheckPointGuts(), the 
state would be updated to DB_IN_PRODUCTION and CreateRestartPoint() can see the 
state != DB_IN_ARCHIVE_RECOVERY later.

As far as I read the code, this case seems to be able to make the server unrecoverable. 
If this case happens, since pg_control is not updated, pg_control still indicates the 
REDO LSN of last valid restartpoint. But CreateRestartPoint() seems to delete old WAL 
files based on its "current" REDO LSN not pg_control's REDO LSN. That is, WAL 
files required for the crash recovery starting from pg_control's REDO LSN would be 
removed.

If this understanding is right, to address this issue, probably we need to make 
CreateRestartPoint() do nothing (return immediately) when the state != 
DB_IN_ARCHIVE_RECOVERY?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-02 Thread Bharath Rupireddy

On Tue, Feb 1, 2022 at 9:39 PM Fujii Masao  wrote:
> I found that CreateRestartPoint() already reported the redo lsn as follows 
> after emitting the restartpoint log message. To avoid duplicated logging of 
> the same information, we should update this code?
>
> ereport((log_checkpoints ? LOG : DEBUG2),
> (errmsg("recovery restart point at %X/%X",
> LSN_FORMAT_ARGS(lastCheckPoint.redo)),
>  xtime ? errdetail("Last completed transaction was at 
> log time %s.",
>
> timestamptz_to_str(xtime)) : 0));
>
> This code reports lastCheckPoint.redo as redo lsn. OTOH, with the patch, 
> LogCheckpointEnd() reports ControlFile->checkPointCopy.redo. They may be 
> different, for example, when the current DB state is not 
> DB_IN_ARCHIVE_RECOVERY. In this case, which lsn should we report as redo lsn?

Do we ever reach CreateRestartPoint when ControlFile->stat !=
DB_IN_ARCHIVE_RECOVERY? Assert(ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY); in CreateRestartPoint doesn't fail any
regression tests.

Here's what can happen:
lastCheckPoint.redo is 100 and ControlFile->checkPointCopy.redo is
105, so, "skipping restartpoint, already performed at %X/%X"
LogCheckpointEnd isn't reached
lastCheckPoint.redo is 105 and ControlFile->checkPointCopy.redo is 100
and ControlFile->state == DB_IN_ARCHIVE_RECOVERY, then the control
file gets updated and LogCheckpointEnd prints the right redo lsn.
lastCheckPoint.redo is 105 and ControlFile->checkPointCopy.redo is 100
and ControlFile->state != DB_IN_ARCHIVE_RECOVERY, the the control file
doesn't get updated and LogCheckpointEnd just prints the control redo
lsn. Looks like this case is rare given Assert(ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY); doesn't fail any tests.

I think we should just let LogCheckpointEnd print the redo lsn from
the control file. We can just remove the above errmsg("recovery
restart point at %X/%X" message altogether or just print it only in
the rare scenario, something like below:

if (ControlFile->state != DB_IN_ARCHIVE_RECOVERY)
{
ereport((log_checkpoints ? LOG : DEBUG2),
(errmsg("performed recovery restart point at %X/%X while
the database state is %s",
LSN_FORMAT_ARGS(lastCheckPoint.redo)),
getDBState(ControlFile->state)));
}

And the last commit/abort records's timestamp will always get logged
even before we reach here in the main redo loop (errmsg("last
completed transaction was at log time %s").

Or another way is to just pass the redo lsn to LogCheckpointEnd and
pass the lastCheckPoint.redo in if (ControlFile->state !=
DB_IN_ARCHIVE_RECOVERY) cases or when control file isn't updated but
restart point happened.

Thoughts?

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-01 Thread Stephen Frost

Greetings,

* Fujii Masao (masao.fu...@oss.nttdata.com) wrote:
> On 2022/02/01 22:03, Bharath Rupireddy wrote:
> >On Tue, Feb 1, 2022 at 11:58 AM Kyotaro Horiguchi
> > wrote:
> >>>Modified in v8.
> >>
> >>0001 looks good to me.
> 
> I found that CreateRestartPoint() already reported the redo lsn as follows 
> after emitting the restartpoint log message. To avoid duplicated logging of 
> the same information, we should update this code?
> 
>   ereport((log_checkpoints ? LOG : DEBUG2),
>   (errmsg("recovery restart point at %X/%X",
>   LSN_FORMAT_ARGS(lastCheckPoint.redo)),
>xtime ? errdetail("Last completed transaction was at 
> log time %s.",
>  
> timestamptz_to_str(xtime)) : 0));
> 
> This code reports lastCheckPoint.redo as redo lsn. OTOH, with the patch, 
> LogCheckpointEnd() reports ControlFile->checkPointCopy.redo. They may be 
> different, for example, when the current DB state is not 
> DB_IN_ARCHIVE_RECOVERY. In this case, which lsn should we report as redo lsn?
> 
> + "lsn=%X/%X, redo lsn=%X/%X",
> 
> Originally you proposed to use upper cases for "lsn". But the latest patch 
> uses the lower cases. Why? It seems better to use upper cases, i.e., LSN and 
> REDO LSN because LSN is basically used in other errmsg().

We do use 'lsn=' quite a bit in verify_nbtree.c already and lowercase is
also what's in the various functions and views in the catalog in the
database, of course.  I don't see even one usage of "LSN=" in the tree
today.  We also use 'lsn %X/%X' in replorigindesc.c, xactdesc.c,
xactdesc.c, tablesync.c, standby.c, parsexlog.c, then 'redo %X/%X' in
xactdesc.c.

xlog.c does have a number of "WAL location (LSN)", along with a bunch of
other usages.  logical.c uses both "LSN" and "lsn".  worker.c uses
"LSN".  slot.c uses "restart_lsn". pg_rewind.c uses "WAL location" while
pg_waldump.c uses, "lsn:", "WAL location", and "WAL record".

Overall, we don't seem to be super consistent, but I'd say that 'lsn='
looks the best, to my eyes anyway, and isn't out of place in the code
base.  Lowercase seems to generally be more common overall. 

> >Attaching the above changes 0003 (0001 and 0002 remain the same). If
> >the committer doesn't agree on the text or wording in 0003, I would
> >like the 0001 and 0002 to be taken here and I can start a new thread
> >for discussing 0003 separately.
> 
> Personally I'm ok with 001, but regarding 0002 and 0003 patches, I'm not sure 
> if it's really worth replacing "location" with "lsn" there. BTW, the similar 
> idea was proposed at [1] before, but seems "location" was left as it was.
> 
> [1]
> https://postgr.es/m/20487.1494514...@sss.pgh.pa.us

This discussion strikes me as sufficient reason to make the change, with
the prior comment not really having all that much weight.  When we're
actually pretty consistent with one term, having random places where we
are inconsistent leads people to be unsure about which way to go and
then we end up having to have this discussion.  Would be great to avoid
having to have it again in the future.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-01 Thread Fujii Masao





On 2022/02/01 22:03, Bharath Rupireddy wrote:

On Tue, Feb 1, 2022 at 11:58 AM Kyotaro Horiguchi
 wrote:

Modified in v8.


0001 looks good to me.


I found that CreateRestartPoint() already reported the redo lsn as follows 
after emitting the restartpoint log message. To avoid duplicated logging of the 
same information, we should update this code?

ereport((log_checkpoints ? LOG : DEBUG2),
(errmsg("recovery restart point at %X/%X",
LSN_FORMAT_ARGS(lastCheckPoint.redo)),
 xtime ? errdetail("Last completed transaction was at log 
time %s.",
   
timestamptz_to_str(xtime)) : 0));

This code reports lastCheckPoint.redo as redo lsn. OTOH, with the patch, 
LogCheckpointEnd() reports ControlFile->checkPointCopy.redo. They may be 
different, for example, when the current DB state is not DB_IN_ARCHIVE_RECOVERY. 
In this case, which lsn should we report as redo lsn?

+   "lsn=%X/%X, redo lsn=%X/%X",

Originally you proposed to use upper cases for "lsn". But the latest patch uses 
the lower cases. Why? It seems better to use upper cases, i.e., LSN and REDO LSN because 
LSN is basically used in other errmsg().


Attaching the above changes 0003 (0001 and 0002 remain the same). If
the committer doesn't agree on the text or wording in 0003, I would
like the 0001 and 0002 to be taken here and I can start a new thread
for discussing 0003 separately.


Personally I'm ok with 001, but regarding 0002 and 0003 patches, I'm not sure if it's really worth replacing 
"location" with "lsn" there. BTW, the similar idea was proposed at [1] before, but seems 
"location" was left as it was.

[1]
https://postgr.es/m/20487.1494514...@sss.pgh.pa.us

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-02-01 Thread Bharath Rupireddy

On Tue, Feb 1, 2022 at 11:58 AM Kyotaro Horiguchi
 wrote:
> > Modified in v8.
>
> 0001 looks good to me.
>
> I tend to agree to 0002.

Thanks.

> FWIW, I collected other user-facing usage of "location" as LSN.
>
> xlog.c:11298, 11300:  (in backup-label)
>   appendStringInfo(labelfile, "START WAL LOCATION: %X/%X (file %s)\n",
>   appendStringInfo(labelfile, "CHECKPOINT LOCATION: %X/%X\n",
>   (and corresponding reader-code)
>
> xlog,c:11791, 11793:  (in backup history file)
>   fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
>   fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
>   (and corresponding reader-code)

I tried to change the "location" to "lsn" in most of the user-facing
messages/text. I refrained from changing the bakup_label file content
(above) as it might break many applications/service layer code and
it's not good for backward compatibility.

Attaching the above changes 0003 (0001 and 0002 remain the same). If
the committer doesn't agree on the text or wording in 0003, I would
like the 0001 and 0002 to be taken here and I can start a new thread
for discussing 0003 separately.

Regards,
Bharath Rupireddy.

v8-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data

v8-0002-Change-location-to-lsn-in-pg_controldata.patch
Description: Binary data

v8-0003-Change-location-to-lsn-in-user-facing-text.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Kyotaro Horiguchi

At Tue, 1 Feb 2022 10:08:04 +0530, Bharath Rupireddy 
 wrote in 
> On Tue, Feb 1, 2022 at 9:49 AM Fujii Masao  
> wrote:
> >
> > My previous comment was confusing... Probably I understand why you tried to 
> > put this information in checkpoint log message. But I was suggesting to put 
> > that information at the end of log message instead of the beginning of it. 
> > Because ordinary users would be less interested in this LSN information 
> > than other ones like the number of buffers written.
> 
> Actually, there's no strong reason to put LSN info at the beginning of
> the message except that LSN/REDO LSN next to the
> checkpoint/restartpoint complete would make the users understand the
> LSN and REDO LSN belong to the checkpoint/restartpoint. Since this
> wasn't a strong reason, I agree to keep it at the end.
> 
> Modified in v8.
> 
> [1]
> 2022-02-01 04:34:17.657 UTC [3597073] LOG:  checkpoint complete: wrote
> 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=0.004 s, sync=0.008 s, total=0.031 s; sync files=18,
> longest=0.006 s, average=0.001 s; distance=77 kB, estimate=77 kB;
> lsn=0/14D5AF0, redo lsn=0/14D5AB8

0001 looks good to me.

I tend to agree to 0002.


FWIW, I collected other user-facing usage of "location" as LSN.

xlog.c:5965, 6128:
  (errmsg("recovery stopping before WAL location (LSN) \"%X/%X\"",

xlog.c:6718:
  (errmsg("control file contains invalid checkpoint location")));

xlog.c:6846:
  (errmsg("starting point-in-time recovery to WAL location (LSN) \"%X/%X\"",

xlog.c:6929:
  (errmsg("could not find redo location referenced by checkpoint record"),

xlog.c:11298, 11300:  (in backup-label)
  appendStringInfo(labelfile, "START WAL LOCATION: %X/%X (file %s)\n",
  appendStringInfo(labelfile, "CHECKPOINT LOCATION: %X/%X\n",
  (and corresponding reader-code)

xlog,c:11791, 11793:  (in backup history file)
  fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
  fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  (and corresponding reader-code)

repl_scanner.l:151:
  yyerror("invalid streaming start location");

pg_proc.dat:
  many function descriptions use "location" as LSN.

pg_waldump.c:768,777,886,938,1029,1071,1083:
  printf(_("  -e, --end=RECPTR   stop reading at WAL location RECPTR\n"));
  printf(_("  -s, --start=RECPTR start reading at WAL location RECPTR\n"));
  pg_log_error("could not parse end WAL location \"%s\"",
  pg_log_error("could not parse start WAL location \"%s\"",
  pg_log_error("start WAL location %X/%X is not inside file \"%s\"",
  pg_log_error("end WAL location %X/%X is not inside file \"%s\"",
  pg_log_error("no start WAL location given");

pg_basebackup.c:476, 615: (confusing with file/directory path..)
  pg_log_error("could not parse write-ahead log location \"%s\"",
  pg_log_error("could not parse write-ahead log location \"%s\"",

pg_rewind.c:346:
  pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
pg_rewind/timeline.c:82:
  pg_log_error("Expected a write-ahead log switchpoint location.");

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Bharath Rupireddy

On Tue, Feb 1, 2022 at 9:49 AM Fujii Masao  wrote:
>
> My previous comment was confusing... Probably I understand why you tried to 
> put this information in checkpoint log message. But I was suggesting to put 
> that information at the end of log message instead of the beginning of it. 
> Because ordinary users would be less interested in this LSN information than 
> other ones like the number of buffers written.

Actually, there's no strong reason to put LSN info at the beginning of
the message except that LSN/REDO LSN next to the
checkpoint/restartpoint complete would make the users understand the
LSN and REDO LSN belong to the checkpoint/restartpoint. Since this
wasn't a strong reason, I agree to keep it at the end.

Modified in v8.

[1]
2022-02-01 04:34:17.657 UTC [3597073] LOG:  checkpoint complete: wrote
21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.004 s, sync=0.008 s, total=0.031 s; sync files=18,
longest=0.006 s, average=0.001 s; distance=77 kB, estimate=77 kB;
lsn=0/14D5AF0, redo lsn=0/14D5AB8

Regards,
Bharath Rupireddy.


v8-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data


v8-0002-Change-location-to-lsn-in-pg_controldata.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Fujii Masao





On 2022/02/01 13:01, Bharath Rupireddy wrote:

On Tue, Feb 1, 2022 at 9:10 AM Fujii Masao  wrote:

The order of arguments for LSN seems wrong. 
LSN_FORMAT_ARGS(ControlFile->checkPoint) should be specified ahead of 
LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo)?


Thanks. Corrected.


Thanks!


Could you tell me why the information for LSN is reported earlierly in the log 
message? Since ordinally users would be more interested in the information 
about I/O by checkpoint, the information for LSN should be placed later? Sorry 
if this was already discussed.


It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more context
while analyzing checkpoint-related issues. The pg_controldata gives
the last checkpoint LSN and REDO LSN, but having this info alongside
the log message helps analyze issues that happened previously, connect
the dots and identify the root cause.


My previous comment was confusing... Probably I understand why you tried to put 
this information in checkpoint log message. But I was suggesting to put that 
information at the end of log message instead of the beginning of it. Because 
ordinary users would be less interested in this LSN information than other ones 
like the number of buffers written.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Bharath Rupireddy

On Tue, Feb 1, 2022 at 9:10 AM Fujii Masao  wrote:
> The order of arguments for LSN seems wrong. 
> LSN_FORMAT_ARGS(ControlFile->checkPoint) should be specified ahead of 
> LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo)?

Thanks. Corrected.

> Could you tell me why the information for LSN is reported earlierly in the 
> log message? Since ordinally users would be more interested in the 
> information about I/O by checkpoint, the information for LSN should be placed 
> later? Sorry if this was already discussed.

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more context
while analyzing checkpoint-related issues. The pg_controldata gives
the last checkpoint LSN and REDO LSN, but having this info alongside
the log message helps analyze issues that happened previously, connect
the dots and identify the root cause.

Regards,
Bharath Rupireddy.

v7-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data

v7-0002-Change-location-to-lsn-in-pg_controldata.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Fujii Masao





On 2022/02/01 10:44, Nathan Bossart wrote:

On Tue, Feb 01, 2022 at 12:23:10AM +0530, Bharath Rupireddy wrote:

On Tue, Feb 1, 2022 at 12:00 AM Nathan Bossart  wrote:

I think the pg_controldata change needs some extra spaces for alignment,
but otherwise these patches seem reasonable to me.


Thanks. My bad it was. Changed in v6.


-   (errmsg("restartpoint complete: wrote %d buffers 
(%.1f%%); "
+   (errmsg("restartpoint complete: lsn=%X/%X, redo 
lsn=%X/%X; "
+   "wrote %d buffers (%.1f%%); "
"%d WAL file(s) added, %d removed, 
%d recycled; "
"write=%ld.%03d s, sync=%ld.%03d s, 
total=%ld.%03d s; "
"sync files=%d, longest=%ld.%03d s, 
average=%ld.%03d s; "
"distance=%d kB, estimate=%d 
kB",
+   
LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),
+   
LSN_FORMAT_ARGS(ControlFile->checkPoint),

The order of arguments for LSN seems wrong. 
LSN_FORMAT_ARGS(ControlFile->checkPoint) should be specified ahead of 
LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo)?

Could you tell me why the information for LSN is reported earlierly in the log 
message? Since ordinally users would be more interested in the information 
about I/O by checkpoint, the information for LSN should be placed later? Sorry 
if this was already discussed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Nathan Bossart

On Tue, Feb 01, 2022 at 12:23:10AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 1, 2022 at 12:00 AM Nathan Bossart  
> wrote:
>> I think the pg_controldata change needs some extra spaces for alignment,
>> but otherwise these patches seem reasonable to me.
> 
> Thanks. My bad it was. Changed in v6.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Bharath Rupireddy

On Tue, Feb 1, 2022 at 12:00 AM Nathan Bossart  wrote:
>
> On Mon, Jan 31, 2022 at 11:45:19PM +0530, Bharath Rupireddy wrote:
> > Thanks. Here are 2 patches, 0001 for adding checkpoint lsn and redo
> > lsn in the checkpoint completed message and 0002 for changing the
> > "location" to LSN in pg_controdata's output. With the 0002,
> > pg_control_checkpont, pg_controldata and checkpoint completed message
> > will all be in sync with the checkpoint lsn and redo lsn.
>
> I think the pg_controldata change needs some extra spaces for alignment,
> but otherwise these patches seem reasonable to me.

Thanks. My bad it was. Changed in v6.

Regards,
Bharath Rupireddy.


v6-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data


v6-0002-change-location-to-lsn-in-pg_controldata.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Nathan Bossart

On Mon, Jan 31, 2022 at 11:45:19PM +0530, Bharath Rupireddy wrote:
> Thanks. Here are 2 patches, 0001 for adding checkpoint lsn and redo
> lsn in the checkpoint completed message and 0002 for changing the
> "location" to LSN in pg_controdata's output. With the 0002,
> pg_control_checkpont, pg_controldata and checkpoint completed message
> will all be in sync with the checkpoint lsn and redo lsn.

I think the pg_controldata change needs some extra spaces for alignment,
but otherwise these patches seem reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Bharath Rupireddy

On Mon, Jan 31, 2022 at 8:41 PM Stephen Frost  wrote:
>
> > Thanks for your review. In summary, we have these options to choose
> > checkpoint LSN and last REDO LSN:
> >
> > 1)  "start=%X/%X, end=%X/%X" (ControlFile->checkPointCopy.redo,
> > ControlFile->checkPoint)
> > 2)  "lsn=%X/%X, redo lsn=%X/%X"
> > 3) "location=%X/%X, REDO location=%X/%X" -- similar to what
> > pg_controldata and pg_control_checkpoint shows currently.
> > 4) "location=%X/%X, REDO start location=%X/%X"
> >
> > I will leave that decision to the committer.
>
> I'd also vote for #2.  Regarding 3 and 4, I'd argue that those should
> have been changed when we changed a number of other things from the
> generic 'location' to be 'lsn' in system views and functions, and
> therefore we should go change those to also specify 'lsn' rather than
> just saying 'location'.

Thanks. Here are 2 patches, 0001 for adding checkpoint lsn and redo
lsn in the checkpoint completed message and 0002 for changing the
"location" to LSN in pg_controdata's output. With the 0002,
pg_control_checkpont, pg_controldata and checkpoint completed message
will all be in sync with the checkpoint lsn and redo lsn.

Regards,
Bharath Rupireddy.
From beafea457cd80266c6e517862ed819da7b3c8ec0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 31 Jan 2022 16:51:47 +
Subject: [PATCH v5] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN(end) and REDO LSN(start). It gives more
context while analyzing checkpoint-related issues. The pg_controldata
gives the last checkpoint LSN and REDO LSN, but having this info
alongside the log message helps analyze issues that happened
previously, connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..71a5239619 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8919,11 +8919,14 @@ LogCheckpointEnd(bool restartpoint)
 
 	if (restartpoint)
 		ereport(LOG,
-(errmsg("restartpoint complete: wrote %d buffers (%.1f%%); "
+(errmsg("restartpoint complete: lsn=%X/%X, redo lsn=%X/%X; "
+		"wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		"distance=%d kB, estimate=%d kB",
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -8939,11 +8942,14 @@ LogCheckpointEnd(bool restartpoint)
 		(int) (CheckPointDistanceEstimate / 1024.0;
 	else
 		ereport(LOG,
-(errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
+(errmsg("checkpoint complete: lsn=%X/%X, redo lsn=%X/%X; "
+		"wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		"distance=%d kB, estimate=%d kB",
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
-- 
2.25.1

From fd466fb14f16db15292974b475338335668148bf Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 31 Jan 2022 17:50:40 +
Subject: [PATCH v5] change "location" to "lsn" in pg_controldata

---
 src/bin/pg_controldata/pg_controldata.c| 10 +-
 src/test/recovery/t/016_min_consistency.pl |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..0c5ef7804c 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -235,9 +235,9 @@ main(int argc, char *argv[])
 		   dbState(ControlFile->state));
 	printf(_("pg_control last modified: %s\n"),
 		   pgctime_str);
-	printf(_("Latest checkpoint location:   %X/%X\n"),
+	printf(_("Latest checkpoint LSN:   %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->checkPoint));
-	printf(_(&qu

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-31 Thread Stephen Frost

Greetings,

* Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> On Fri, Jan 28, 2022 at 11:16 AM Nathan Bossart
>  wrote:
> > I know I voted for "start=%X/%X, end=%X/%X," but looking at this again, I
> > wonder if it could be misleading.  "start" is the redo location, and "end"
> > is the location of the checkpoint record, but I could understand why
> > someone might think that "start" is the location of the previous checkpoint
> > record and "end" is the redo location of the new one.  I think your
> > original idea of "lsn=%X/%X, redo lsn=%X/%X" could be a good alternative.
> >
> > Іn any case, this patch is small and otherwise looks reasonable to me, so I
> > am going to mark it as ready-for-committer.
> 
> Thanks for your review. In summary, we have these options to choose
> checkpoint LSN and last REDO LSN:
> 
> 1)  "start=%X/%X, end=%X/%X" (ControlFile->checkPointCopy.redo,
> ControlFile->checkPoint)
> 2)  "lsn=%X/%X, redo lsn=%X/%X"
> 3) "location=%X/%X, REDO location=%X/%X" -- similar to what
> pg_controldata and pg_control_checkpoint shows currently.
> 4) "location=%X/%X, REDO start location=%X/%X"
> 
> I will leave that decision to the committer.

I'd also vote for #2.  Regarding 3 and 4, I'd argue that those should
have been changed when we changed a number of other things from the
generic 'location' to be 'lsn' in system views and functions, and
therefore we should go change those to also specify 'lsn' rather than
just saying 'location'.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-27 Thread Bharath Rupireddy

On Fri, Jan 28, 2022 at 11:16 AM Nathan Bossart
 wrote:
>
> I know I voted for "start=%X/%X, end=%X/%X," but looking at this again, I
> wonder if it could be misleading.  "start" is the redo location, and "end"
> is the location of the checkpoint record, but I could understand why
> someone might think that "start" is the location of the previous checkpoint
> record and "end" is the redo location of the new one.  I think your
> original idea of "lsn=%X/%X, redo lsn=%X/%X" could be a good alternative.
>
> Іn any case, this patch is small and otherwise looks reasonable to me, so I
> am going to mark it as ready-for-committer.

Thanks for your review. In summary, we have these options to choose
checkpoint LSN and last REDO LSN:

1)  "start=%X/%X, end=%X/%X" (ControlFile->checkPointCopy.redo,
ControlFile->checkPoint)
2)  "lsn=%X/%X, redo lsn=%X/%X"
3) "location=%X/%X, REDO location=%X/%X" -- similar to what
pg_controldata and pg_control_checkpoint shows currently.
4) "location=%X/%X, REDO start location=%X/%X"

I will leave that decision to the committer.

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-27 Thread Nathan Bossart

On Fri, Jan 28, 2022 at 08:43:36AM +0530, Bharath Rupireddy wrote:
> 2022-01-28 03:06:10.213 UTC [2409486] LOG:  checkpoint starting:
> immediate force wait
> 2022-01-28 03:06:10.257 UTC [2409486] LOG:  checkpoint complete:
> start=0/14D9510, end=0/14D9548; wrote 4 buffers (0.0%); 0 WAL file(s)
> added, 0 removed, 0 recycled; write=0.007 s, sync=0.008 s, total=0.044
> s; sync files=3, longest=0.005 s, average=0.003 s; distance=0 kB,
> estimate=0 kB
> 
> 2022-01-28 03:06:42.254 UTC [2409486] LOG:  checkpoint starting:
> immediate force wait
> 2022-01-28 03:06:42.279 UTC [2409486] LOG:  checkpoint complete:
> start=0/14DBDB8, end=0/14DBDF0; wrote 2 buffers (0.0%); 0 WAL file(s)
> added, 0 removed, 0 recycled; write=0.004 s, sync=0.004 s, total=0.025
> s; sync files=2, longest=0.003 s, average=0.002 s; distance=10 kB,
> estimate=10 kB

I know I voted for "start=%X/%X, end=%X/%X," but looking at this again, I
wonder if it could be misleading.  "start" is the redo location, and "end"
is the location of the checkpoint record, but I could understand why
someone might think that "start" is the location of the previous checkpoint
record and "end" is the redo location of the new one.  I think your
original idea of "lsn=%X/%X, redo lsn=%X/%X" could be a good alternative.

Іn any case, this patch is small and otherwise looks reasonable to me, so I
am going to mark it as ready-for-committer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-27 Thread Bharath Rupireddy

On Thu, Jan 20, 2022 at 8:30 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 20 Jan 2022 00:36:32 +, "Bossart, Nathan"  
> wrote in
> > On 1/3/22, 5:52 PM, "Kyotaro Horiguchi"  wrote:
> > > It seems to me "LSN" or just "location" is more confusing or
> > > mysterious than "REDO LSN" for the average user. If we want to avoid
> > > being technically too detailed, we would use just "start LSN=%X/%X,
> > > end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..
> >
> > My first instinct was that this should stay aligned with
> > pg_controldata, but that would mean using "location=%X/%X, REDO
> > location=%X/%X," which doesn't seem terribly descriptive.  IIUC the
> > "checkpoint location" is the LSN of the WAL record for the checkpoint,
> > and the "checkpoint's REDO location" is the LSN where checkpoint
> > creation began (i.e., what you must retain for crash recovery).  My
> > vote is for "start=%X/%X, end=%X/%X."
>
> +1. Works for me.  %X/%X itself expresses it is an LSN.

Thanks for the review. Here's the v4 patch, please have a look.

Let's not attempt to change how pg_controldata (tool and core
functions) emit the start and end LSN as checkpoint_lsn/redo_lsn and
checkpoint location/checkpoint's REDO location.

[1]
2022-01-28 03:06:10.213 UTC [2409486] LOG:  checkpoint starting:
immediate force wait
2022-01-28 03:06:10.257 UTC [2409486] LOG:  checkpoint complete:
start=0/14D9510, end=0/14D9548; wrote 4 buffers (0.0%); 0 WAL file(s)
added, 0 removed, 0 recycled; write=0.007 s, sync=0.008 s, total=0.044
s; sync files=3, longest=0.005 s, average=0.003 s; distance=0 kB,
estimate=0 kB

2022-01-28 03:06:42.254 UTC [2409486] LOG:  checkpoint starting:
immediate force wait
2022-01-28 03:06:42.279 UTC [2409486] LOG:  checkpoint complete:
start=0/14DBDB8, end=0/14DBDF0; wrote 2 buffers (0.0%); 0 WAL file(s)
added, 0 removed, 0 recycled; write=0.004 s, sync=0.004 s, total=0.025
s; sync files=2, longest=0.003 s, average=0.002 s; distance=10 kB,
estimate=10 kB

Regards,
Bharath Rupireddy.


v4-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-27 Thread Nathan Bossart

On Thu, Jan 27, 2022 at 08:37:37PM +0530, Bharath Rupireddy wrote:
> I'm still not clear how the REDO location can be treated as a start
> LSN? Can someone throw some light one what this checkpoint's REDO
> location is?

It's the WAL insert location at the time the checkpoint began (i.e., where
you need to begin replaying WAL from after a crash).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-27 Thread Bharath Rupireddy

On Thu, Jan 20, 2022 at 6:06 AM Bossart, Nathan  wrote:
>
> On 1/3/22, 5:52 PM, "Kyotaro Horiguchi"  wrote:
> > It seems to me "LSN" or just "location" is more confusing or
> > mysterious than "REDO LSN" for the average user. If we want to avoid
> > being technically too detailed, we would use just "start LSN=%X/%X,
> > end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..
>
> My first instinct was that this should stay aligned with
> pg_controldata, but that would mean using "location=%X/%X, REDO
> location=%X/%X," which doesn't seem terribly descriptive.  IIUC the
> "checkpoint location" is the LSN of the WAL record for the checkpoint,
> and the "checkpoint's REDO location" is the LSN where checkpoint
> creation began (i.e., what you must retain for crash recovery).  My
> vote is for "start=%X/%X, end=%X/%X."

I'm still not clear how the REDO location can be treated as a start
LSN? Can someone throw some light one what this checkpoint's REDO
location is?

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-19 Thread Kyotaro Horiguchi

At Thu, 20 Jan 2022 00:36:32 +, "Bossart, Nathan"  
wrote in 
> On 1/3/22, 5:52 PM, "Kyotaro Horiguchi"  wrote:
> > It seems to me "LSN" or just "location" is more confusing or
> > mysterious than "REDO LSN" for the average user. If we want to avoid
> > being technically too detailed, we would use just "start LSN=%X/%X,
> > end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..
> 
> My first instinct was that this should stay aligned with
> pg_controldata, but that would mean using "location=%X/%X, REDO
> location=%X/%X," which doesn't seem terribly descriptive.  IIUC the
> "checkpoint location" is the LSN of the WAL record for the checkpoint,
> and the "checkpoint's REDO location" is the LSN where checkpoint
> creation began (i.e., what you must retain for crash recovery).  My
> vote is for "start=%X/%X, end=%X/%X."

+1. Works for me.  %X/%X itself expresses it is an LSN.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-19 Thread Bossart, Nathan

On 1/3/22, 5:52 PM, "Kyotaro Horiguchi"  wrote:
> It seems to me "LSN" or just "location" is more confusing or
> mysterious than "REDO LSN" for the average user. If we want to avoid
> being technically too detailed, we would use just "start LSN=%X/%X,
> end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..

My first instinct was that this should stay aligned with
pg_controldata, but that would mean using "location=%X/%X, REDO
location=%X/%X," which doesn't seem terribly descriptive.  IIUC the
"checkpoint location" is the LSN of the WAL record for the checkpoint,
and the "checkpoint's REDO location" is the LSN where checkpoint
creation began (i.e., what you must retain for crash recovery).  My
vote is for "start=%X/%X, end=%X/%X."

Nathan

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-14 Thread Bharath Rupireddy

On Thu, Jan 13, 2022 at 11:50 AM Bharath Rupireddy
 wrote:
> Thanks.  IMO, the following format of logging is better, so attaching
> the v2-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch as
> .patch
>
> 2021-12-28 02:52:24.464 UTC [2394396] LOG:  checkpoint completed at
> location=0/212FFC8 with REDO start location=0/212FF90: wrote 451
> buffers (2.8%); 0 WAL file(s) added, 0 removed, 1 recycled;
> write=0.012 s, sync=0.032 s, total=0.071 s; sync files=6,
> longest=0.022 s, average=0.006 s; distance=6272 kB, estimate=6272 kB

One of the test cases was failing with the above style of the log
message, changing "checkpoint complete" to "checkpoint completed at
location" doesn't seem to be a better idea. It looks like the users or
the log monitoring tools might be using the same text "checkpoint
complete", therefore I don't want to break that. Here's the v3 patch
that I think will work better. Please review.

Regards,
Bharath Rupireddy.
From b6f2a53bcaea2d37e317bc3d9922893c10712d3a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 14 Jan 2022 12:55:30 +0000
Subject: [PATCH v3] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN and REDO LSN. It gives more context while
analyzing checkpoint-related issues. The pg_controldata gives the
last checkpoint LSN and REDO LSN, but having this info alongside
the log message helps analyze issues that happened previously,
connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..b2eabc72f9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8920,11 +8920,14 @@ LogCheckpointEnd(bool restartpoint)
 
 	if (restartpoint)
 		ereport(LOG,
-(errmsg("restartpoint complete: wrote %d buffers (%.1f%%); "
+(errmsg("restartpoint complete: location=%X/%X, REDO start location=%X/%X; "
+		"wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		"distance=%d kB, estimate=%d kB",
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
@@ -8940,11 +8943,14 @@ LogCheckpointEnd(bool restartpoint)
 		(int) (CheckPointDistanceEstimate / 1024.0;
 	else
 		ereport(LOG,
-(errmsg("checkpoint complete: wrote %d buffers (%.1f%%); "
+(errmsg("checkpoint complete: location=%X/%X, REDO start location=%X/%X; "
+		"wrote %d buffers (%.1f%%); "
 		"%d WAL file(s) added, %d removed, %d recycled; "
 		"write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
 		"sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		"distance=%d kB, estimate=%d kB",
+		LSN_FORMAT_ARGS(ControlFile->checkPoint),
+		LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),
 		CheckpointStats.ckpt_bufs_written,
 		(double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
 		CheckpointStats.ckpt_segs_added,
-- 
2.25.1

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-12 Thread Bharath Rupireddy

On Wed, Jan 12, 2022 at 11:39 AM Julien Rouhaud  wrote:
>
> Hi,
>
> On Tue, Dec 28, 2021 at 10:56 AM Bharath Rupireddy
>  wrote:
> >
> > attaching v1-0001-XXX from the initial mail again just for the sake of
> > completion:
>
> Unfortunately this breaks the cfbot as it tries to apply this patch
> too: http://cfbot.cputube.org/patch_36_3474.log.
>
> For this kind of situation I think that the usual solution is to use a
> .txt extension to make sure that the cfbot won't try to apply it.

Thanks.  IMO, the following format of logging is better, so attaching
the v2-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch as
.patch

2021-12-28 02:52:24.464 UTC [2394396] LOG:  checkpoint completed at
location=0/212FFC8 with REDO start location=0/212FF90: wrote 451
buffers (2.8%); 0 WAL file(s) added, 0 removed, 1 recycled;
write=0.012 s, sync=0.032 s, total=0.071 s; sync files=6,
longest=0.022 s, average=0.006 s; distance=6272 kB, estimate=6272 kB

Others are attached as .txt files.

Regards,
Bharath Rupireddy.
From c23010c67a66dd21e318ae4d475ee0d85c5c1d08 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 28 Dec 2021 02:52:49 +
Subject: [PATCH v2] Add checkpoint and redo LSN to LogCheckpointEnd log
 message

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN and REDO LSN. It gives more context while
analyzing checkpoint-related issues. The pg_controldata gives the
last checkpoint LSN and REDO LSN, but having this info alongside
the log message helps analyze issues that happened previously,
connect the dots and identify the root cause.
---
 src/backend/access/transam/xlog.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 1e1fbe957f..cd3fce6a2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8920,11 +8920,14 @@ LogCheckpointEnd(bool restartpoint)
 
if (restartpoint)
ereport(LOG,
-   (errmsg("restartpoint complete: wrote %d 
buffers (%.1f%%); "
+   (errmsg("restartpoint completed at 
location=%X/%X with REDO start location=%X/%X: "
+   "wrote %d buffers (%.1f%%); "
"%d WAL file(s) added, %d 
removed, %d recycled; "
"write=%ld.%03d s, 
sync=%ld.%03d s, total=%ld.%03d s; "
"sync files=%d, 
longest=%ld.%03d s, average=%ld.%03d s; "
"distance=%d kB, estimate=%d 
kB",
+   
LSN_FORMAT_ARGS(ControlFile->checkPoint),
+   
LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),

CheckpointStats.ckpt_bufs_written,
(double) 
CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
CheckpointStats.ckpt_segs_added,
@@ -8940,11 +8943,14 @@ LogCheckpointEnd(bool restartpoint)
(int) 
(CheckPointDistanceEstimate / 1024.0;
else
ereport(LOG,
-   (errmsg("checkpoint complete: wrote %d buffers 
(%.1f%%); "
+   (errmsg("checkpoint completed at location=%X/%X 
with REDO start location=%X/%X: "
+   "wrote %d buffers (%.1f%%); "
"%d WAL file(s) added, %d 
removed, %d recycled; "
"write=%ld.%03d s, 
sync=%ld.%03d s, total=%ld.%03d s; "
"sync files=%d, 
longest=%ld.%03d s, average=%ld.%03d s; "
"distance=%d kB, estimate=%d 
kB",
+   
LSN_FORMAT_ARGS(ControlFile->checkPoint),
+   
LSN_FORMAT_ARGS(ControlFile->checkPointCopy.redo),

CheckpointStats.ckpt_bufs_written,
(double) 
CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
CheckpointStats.ckpt_segs_added,
-- 
2.25.1



v2-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data
From 999cfd53ce4e16ccfff94c0022cd80fe8ff84be5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 28 Dec 2021 02:45:51 +
Subject: [PATCH

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud

On Wed, Jan 12, 2022 at 2:52 PM Thomas Munro  wrote:
>
> By way of documentation, I've just now tried to answer these question
> in the new FAQ at:
>
> https://wiki.postgresql.org/wiki/Cfbot

Great!  Thanks a lot!

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane

Thomas Munro  writes:
> By way of documentation, I've just now tried to answer these question
> in the new FAQ at:
> https://wiki.postgresql.org/wiki/Cfbot

Thanks!

regards, tom lane

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Thomas Munro

On Wed, Jan 12, 2022 at 7:37 PM Tom Lane  wrote:
> Julien Rouhaud  writes:
> > On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
> >> 2. You are attaching some random files, and would like to not
> >> displace the cfbot's idea of the latest patchset.
>
> > I'm assuming that someone wanting to send an additional patch to be applied 
> > on
> > top of the OP patchset is part of 2?
>
> AFAIK, if you're submitting a patch then you have to attach a complete
> patchset, or the cfbot will be totally lost.  Getting the bot to
> understand incremental patches would be useful for sure ... but it's
> outside the scope of what I'm asking for now, which is just clear
> documentation of what the bot can do already.

By way of documentation, I've just now tried to answer these question
in the new FAQ at:

https://wiki.postgresql.org/wiki/Cfbot

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud

On Wed, Jan 12, 2022 at 2:37 PM Tom Lane  wrote:
>
> AFAIK, if you're submitting a patch then you have to attach a complete
> patchset, or the cfbot will be totally lost.  Getting the bot to
> understand incremental patches would be useful for sure ... but it's
> outside the scope of what I'm asking for now, which is just clear
> documentation of what the bot can do already.

Right, but the use case I'm mentioning is a bit different: provide
another patch while *not* triggering the cfbot.  I've seen cases in
the past where people want to share some code to the OP and it seems
reasonable to allow that without risking the trigger the cfbot, at
least not without the OP validating or adapting the changes.

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane

Julien Rouhaud  writes:
> On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
>> 2. You are attaching some random files, and would like to not
>> displace the cfbot's idea of the latest patchset.

> I'm assuming that someone wanting to send an additional patch to be applied on
> top of the OP patchset is part of 2?

AFAIK, if you're submitting a patch then you have to attach a complete
patchset, or the cfbot will be totally lost.  Getting the bot to
understand incremental patches would be useful for sure ... but it's
outside the scope of what I'm asking for now, which is just clear
documentation of what the bot can do already.

regards, tom lane

Re: cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Julien Rouhaud

On Wed, Jan 12, 2022 at 01:19:22AM -0500, Tom Lane wrote:
> Julien Rouhaud  writes:
> > For this kind of situation I think that the usual solution is to use a
> > .txt extension to make sure that the cfbot won't try to apply it.
> 
> Yeah ... this has come up before.  Is there a documented way to
> attach files that the cfbot will ignore?

Not that I know of unfortunately.  I think the apply part is done by
https://github.com/macdice/cfbot/blob/master/cfbot_patchburner_chroot_ctl.sh#L93-L120,
which seems reasonable.  So basically any extension outside of those could be
used, excluding of course anything clearly suspicious that would be rejected by
many email providers.

> Two specific scenarios seem to be interesting:
> 
> 1. You are attaching patch(es) plus some non-patch files
> 
> 2. You are attaching some random files, and would like to not
> displace the cfbot's idea of the latest patchset.

I'm assuming that someone wanting to send an additional patch to be applied on
top of the OP patchset is part of 2?

cfbot wrangling (was Re: Add checkpoint and redo LSN to LogCheckpointEnd log message)

2022-01-11 Thread Tom Lane

Julien Rouhaud  writes:
> For this kind of situation I think that the usual solution is to use a
> .txt extension to make sure that the cfbot won't try to apply it.

Yeah ... this has come up before.  Is there a documented way to
attach files that the cfbot will ignore?

Two specific scenarios seem to be interesting:

1. You are attaching patch(es) plus some non-patch files

2. You are attaching some random files, and would like to not
displace the cfbot's idea of the latest patchset.

I don't know exactly how to do either of those.

regards, tom lane

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-11 Thread Julien Rouhaud

Hi,

On Tue, Dec 28, 2021 at 10:56 AM Bharath Rupireddy
 wrote:
>
> attaching v1-0001-XXX from the initial mail again just for the sake of
> completion:

Unfortunately this breaks the cfbot as it tries to apply this patch
too: http://cfbot.cputube.org/patch_36_3474.log.

For this kind of situation I think that the usual solution is to use a
.txt extension to make sure that the cfbot won't try to apply it.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2022-01-03 Thread Kyotaro Horiguchi

At Tue, 28 Dec 2021 08:26:28 +0530, Bharath Rupireddy 
 wrote in 
> On Fri, Dec 24, 2021 at 5:54 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Dec 24, 2021 at 5:42 PM Michael Paquier  wrote:
> > >
> > > On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> > > > I thougt about something like the following, but your proposal may be
> > > > clearer.
> > >
> > > +"LSN=%X/%X, REDO LSN=%X/%X",
> > > This could be rather confusing for the average user, even if I agree
> > > that this is some information that only an advanced user could
> > > understand.  Could it be possible to define those fields in a more
> > > deterministic way?  For one, it is hard to understand the relationship
> > > between both fields without looking at the code, particulary if both
> > > share the same value.  It is at least rather..  Well, mostly, easy to
> > > guess what each other field means in this context, which is not the
> > > case of what you are proposing here.  One idea could be use also
> > > "start point" for REDO, for example.
> >
> > How about "location=%X/%X, REDO start location=%X/%X"? The entire log
> > message can look like below:
..
> Thoughts?

It seems to me "LSN" or just "location" is more confusing or
mysterious than "REDO LSN" for the average user. If we want to avoid
being technically too detailed, we would use just "start LSN=%X/%X,
end LSN=%X/%X".  And it is equivalent to "WAL range=[%X/%X, %X/%X]"..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-27 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 5:54 PM Bharath Rupireddy
 wrote:
>
> On Fri, Dec 24, 2021 at 5:42 PM Michael Paquier  wrote:
> >
> > On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> > > I thougt about something like the following, but your proposal may be
> > > clearer.
> >
> > +"LSN=%X/%X, REDO LSN=%X/%X",
> > This could be rather confusing for the average user, even if I agree
> > that this is some information that only an advanced user could
> > understand.  Could it be possible to define those fields in a more
> > deterministic way?  For one, it is hard to understand the relationship
> > between both fields without looking at the code, particulary if both
> > share the same value.  It is at least rather..  Well, mostly, easy to
> > guess what each other field means in this context, which is not the
> > case of what you are proposing here.  One idea could be use also
> > "start point" for REDO, for example.
>
> How about "location=%X/%X, REDO start location=%X/%X"? The entire log
> message can look like below:
>
> 2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint
> complete:location=%X/%X, REDO start location=%X/%X; wrote 7 buffers
> (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s,
> sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s,
> average=0.002 s; distance=293 kB, estimate=56584 kB
>
> Another variant:
> 2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint completed at
> location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
> 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
> s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
> distance=293 kB, estimate=56584 kB
> 2021-12-24 12:20:19.140 UTC [1977834] LOG:  restartpoint completed at
> location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
> 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
> s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
> distance=293 kB, estimate=56584 kB

Here are the 2 patches.

one(v2-1-XXX.patch) adding the info as:
2021-12-28 02:44:34.870 UTC [2384386] LOG:  checkpoint complete:
location=0/1B03040, REDO start location=0/1B03008; wrote 466 buffers
(2.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.014 s,
sync=0.038 s, total=0.072 s; sync files=21, longest=0.022 s,
average=0.002 s; distance=6346 kB, estimate=6346 kB

another(v2-2-XXX.patch) adding the info as:
2021-12-28 02:52:24.464 UTC [2394396] LOG:  checkpoint completed at
location=0/212FFC8 with REDO start location=0/212FF90: wrote 451
buffers (2.8%); 0 WAL file(s) added, 0 removed, 1 recycled;
write=0.012 s, sync=0.032 s, total=0.071 s; sync files=6,
longest=0.022 s, average=0.006 s; distance=6272 kB, estimate=6272 kB

attaching v1-0001-XXX from the initial mail again just for the sake of
completion:
2021-12-23 14:58:54.714 UTC [1965649] LOG:  checkpoint complete: wrote
0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB;
LSN=0/14D0AD0, REDO LSN=0/14D0AD0

Thoughts?

Regards,
Bharath Rupireddy.


v2-1-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data


v2-2-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data


v1-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 5:42 PM Michael Paquier  wrote:
>
> On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> > I thougt about something like the following, but your proposal may be
> > clearer.
>
> +"LSN=%X/%X, REDO LSN=%X/%X",
> This could be rather confusing for the average user, even if I agree
> that this is some information that only an advanced user could
> understand.  Could it be possible to define those fields in a more
> deterministic way?  For one, it is hard to understand the relationship
> between both fields without looking at the code, particulary if both
> share the same value.  It is at least rather..  Well, mostly, easy to
> guess what each other field means in this context, which is not the
> case of what you are proposing here.  One idea could be use also
> "start point" for REDO, for example.

How about "location=%X/%X, REDO start location=%X/%X"? The entire log
message can look like below:

2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint
complete:location=%X/%X, REDO start location=%X/%X; wrote 7 buffers
(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s,
sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s,
average=0.002 s; distance=293 kB, estimate=56584 kB

Another variant:
2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint completed at
location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
distance=293 kB, estimate=56584 kB
2021-12-24 12:20:19.140 UTC [1977834] LOG:  restartpoint completed at
location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
distance=293 kB, estimate=56584 kB

Thoughts?

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Michael Paquier

On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> I thougt about something like the following, but your proposal may be
> clearer.

+"LSN=%X/%X, REDO LSN=%X/%X",
This could be rather confusing for the average user, even if I agree
that this is some information that only an advanced user could
understand.  Could it be possible to define those fields in a more
deterministic way?  For one, it is hard to understand the relationship
between both fields without looking at the code, particulary if both
share the same value.  It is at least rather..  Well, mostly, easy to
guess what each other field means in this context, which is not the
case of what you are proposing here.  One idea could be use also
"start point" for REDO, for example. 
--
Michael


signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 11:21 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 23 Dec 2021 20:35:54 +0530, Bharath Rupireddy 
>  wrote in
> > Hi,
> >
> > It is useful (for debugging purposes) if the checkpoint end message
> > has the checkpoint LSN and REDO LSN [1]. It gives more context while
> > analyzing checkpoint-related issues. The pg_controldata gives the last
> > checkpoint LSN and REDO LSN, but having this info alongside the log
> > message helps analyze issues that happened previously, connect the
> > dots and identify the root cause.
> >
> > Attaching a small patch herewith. Thoughts?
>
> A big +1 from me.  I thought about proposing the same in the past.

Thanks for the review. I've added a CF entry to not lose track -
https://commitfest.postgresql.org/36/3474/.

> > [1]
> > 2021-12-23 14:58:54.694 UTC [1965649] LOG:  checkpoint starting:
> > shutdown immediate
> > 2021-12-23 14:58:54.714 UTC [1965649] LOG:  checkpoint complete: wrote
> > 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0,
> > longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB;
> > LSN=0/14D0AD0, REDO LSN=0/14D0AD0
>
> I thougt about something like the following, but your proposal may be
> clearer.
>
> > WAL range=[0/14D0340, 0/14D0AD0]

Yeah the proposed in the v1 is clear saying checkpoint/restartpoint
LSN and REDO LSN.

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-23 Thread Kyotaro Horiguchi

At Thu, 23 Dec 2021 20:35:54 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> It is useful (for debugging purposes) if the checkpoint end message
> has the checkpoint LSN and REDO LSN [1]. It gives more context while
> analyzing checkpoint-related issues. The pg_controldata gives the last
> checkpoint LSN and REDO LSN, but having this info alongside the log
> message helps analyze issues that happened previously, connect the
> dots and identify the root cause.
> 
> Attaching a small patch herewith. Thoughts?

A big +1 from me.  I thought about proposing the same in the past.

> [1]
> 2021-12-23 14:58:54.694 UTC [1965649] LOG:  checkpoint starting:
> shutdown immediate
> 2021-12-23 14:58:54.714 UTC [1965649] LOG:  checkpoint complete: wrote
> 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB;
> LSN=0/14D0AD0, REDO LSN=0/14D0AD0

I thougt about something like the following, but your proposal may be
clearer.

> WAL range=[0/14D0340, 0/14D0AD0]

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-23 Thread Bharath Rupireddy

Hi,

It is useful (for debugging purposes) if the checkpoint end message
has the checkpoint LSN and REDO LSN [1]. It gives more context while
analyzing checkpoint-related issues. The pg_controldata gives the last
checkpoint LSN and REDO LSN, but having this info alongside the log
message helps analyze issues that happened previously, connect the
dots and identify the root cause.

Attaching a small patch herewith. Thoughts?

[1]
2021-12-23 14:58:54.694 UTC [1965649] LOG:  checkpoint starting:
shutdown immediate
2021-12-23 14:58:54.714 UTC [1965649] LOG:  checkpoint complete: wrote
0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB;
LSN=0/14D0AD0, REDO LSN=0/14D0AD0

Regards,
Bharath Rupireddy.


v1-0001-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch
Description: Binary data

70 matches

Mail list logo