Re: cleanup patches for incremental backup

2024-03-20 Thread Nathan Bossart
On Wed, Mar 20, 2024 at 02:53:01PM -0400, Robert Haas wrote:
> On Wed, Mar 20, 2024 at 2:35 PM Nathan Bossart  
> wrote:
>> Committed.
> 
> Thanks. Sorry you had to clean up after me. :-(

No worries.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-03-20 Thread Robert Haas
On Wed, Mar 20, 2024 at 2:35 PM Nathan Bossart  wrote:
> On Tue, Mar 19, 2024 at 01:15:02PM -0500, Nathan Bossart wrote:
> > Assuming there are no objections or feedback, I plan to commit these two
> > patches within the next couple of days.
>
> Committed.

Thanks. Sorry you had to clean up after me. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-03-20 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 01:15:02PM -0500, Nathan Bossart wrote:
> Assuming there are no objections or feedback, I plan to commit these two
> patches within the next couple of days.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-03-19 Thread Nathan Bossart
On Thu, Mar 14, 2024 at 08:52:55PM -0500, Nathan Bossart wrote:
> Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary
>  test failures."

> Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries().

Assuming there are no objections or feedback, I plan to commit these two
patches within the next couple of days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-03-14 Thread Nathan Bossart
On Thu, Mar 14, 2024 at 04:00:10PM -0500, Nathan Bossart wrote:
> Separately, I suppose it's probably time to revert the temporary debugging
> code adding by commit 5ddf997.  I can craft a patch for that, too.

As promised...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3923e30acb1d4e21ce311b25edbcd8b96b4223d2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 14 Mar 2024 20:36:48 -0500
Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary
 test failures."

This reverts commit 5ddf9973477729cf161b4ad0a1efd52f4fea9c88.
---
 src/backend/backup/walsummary.c   |  7 ---
 src/bin/pg_walsummary/t/002_blocks.pl | 14 --
 2 files changed, 21 deletions(-)

diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c
index 4d047e1c02..322ae3c3ad 100644
--- a/src/backend/backup/walsummary.c
+++ b/src/backend/backup/walsummary.c
@@ -252,15 +252,8 @@ RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time)
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not stat file \"%s\": %m", path)));
-	/* XXX temporarily changed to debug buildfarm failures */
-#if 0
 	ereport(DEBUG2,
 			(errmsg_internal("removing file \"%s\"", path)));
-#else
-	ereport(LOG,
-			(errmsg_internal("removing file \"%s\" cutoff_time=%llu", path,
-			 (unsigned long long) cutoff_time)));
-#endif
 }
 
 /*
diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl
index d4bae3d564..52d3bd8840 100644
--- a/src/bin/pg_walsummary/t/002_blocks.pl
+++ b/src/bin/pg_walsummary/t/002_blocks.pl
@@ -51,7 +51,6 @@ my $summarized_lsn = $node1->safe_psql('postgres', From 38dda89ee48736489d37e18dea186e90358468b0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 14 Mar 2024 20:46:02 -0500
Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries().

---
 src/backend/postmaster/walsummarizer.c | 4 ++--
 src/backend/utils/misc/guc_tables.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ec2874c18c..c820d1f9ed 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
  * GUC parameters
  */
 bool		summarize_wal = false;
-int			wal_summary_keep_time = 10 * 24 * 60;
+int			wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR;
 
 static void WalSummarizerShutdown(int code, Datum arg);
 static XLogRecPtr GetLatestLSN(TimeLineID *tli);
@@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void)
 	 * Files should only be removed if the last modification time precedes the
 	 * cutoff time we compute here.
 	 */
-	cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+	cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE;
 
 	/* Get all the summaries that currently exist. */
 	wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57d9de4dd9..1e71e7db4a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_MIN,
 		},
 		&wal_summary_keep_time,
-		10 * 24 * 60,			/* 10 days */
+		10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */
 		0,
-		INT_MAX,
+		INT_MAX / SECS_PER_MINUTE,
 		NULL, NULL, NULL
 	},
 
-- 
2.25.1



Re: cleanup patches for incremental backup

2024-03-14 Thread Nathan Bossart
On Wed, Jan 24, 2024 at 12:05:15PM -0600, Nathan Bossart wrote:
> There might be an overflow risk in the cutoff time calculation, but I doubt
> that's the root cause of these failures:
> 
>   /*
>* Files should only be removed if the last modification time precedes 
> the
>* cutoff time we compute here.
>*/
>   cutoff_time = time(NULL) - 60 * wal_summary_keep_time;

I've attached a short patch for fixing this overflow risk.  Specifically,
it limits wal_summary_keep_time to INT_MAX / SECS_PER_MINUTE, just like
log_rotation_age.

I considering checking for overflow when we subtract the keep-time from the
result of time(2), but AFAICT there's only a problem if time_t is unsigned,
which Wikipedia leads me to believe is unusual [0], so I figured we might
be able to just wait this one out until 2038.

> Otherwise, I think we'll probably need to add some additional logging to
> figure out what is happening...

Separately, I suppose it's probably time to revert the temporary debugging
code adding by commit 5ddf997.  I can craft a patch for that, too.

[0] https://en.wikipedia.org/wiki/Unix_time#Representing_the_number

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index ec2874c18c..c820d1f9ed 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr;
  * GUC parameters
  */
 bool		summarize_wal = false;
-int			wal_summary_keep_time = 10 * 24 * 60;
+int			wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR;
 
 static void WalSummarizerShutdown(int code, Datum arg);
 static XLogRecPtr GetLatestLSN(TimeLineID *tli);
@@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void)
 	 * Files should only be removed if the last modification time precedes the
 	 * cutoff time we compute here.
 	 */
-	cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
+	cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE;
 
 	/* Get all the summaries that currently exist. */
 	wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 57d9de4dd9..1e71e7db4a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_MIN,
 		},
 		&wal_summary_keep_time,
-		10 * 24 * 60,			/* 10 days */
+		10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */
 		0,
-		INT_MAX,
+		INT_MAX / SECS_PER_MINUTE,
 		NULL, NULL, NULL
 	},
 


Re: cleanup patches for incremental backup

2024-01-31 Thread Robert Haas
On Tue, Jan 30, 2024 at 11:52 AM Robert Haas  wrote:
> Here's a patch for that. I now think
> a7097ca630a25dd2248229f21ebce4968d85d10a was actually misguided, and
> served only to mask some of the failures caused by waiting for the WAL
> summary file.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-30 Thread Robert Haas
On Tue, Jan 30, 2024 at 10:51 AM Robert Haas  wrote:
> I think the solution here is to find a better way to wait for the
> inserts to be summarized, one that actually does wait for that to
> happen.

Here's a patch for that. I now think
a7097ca630a25dd2248229f21ebce4968d85d10a was actually misguided, and
served only to mask some of the failures caused by waiting for the WAL
summary file.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Revise-pg_walsummary-s-002_blocks-test-to-avoid-s.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-30 Thread Robert Haas
On Mon, Jan 29, 2024 at 4:13 PM Nathan Bossart  wrote:
> On Mon, Jan 29, 2024 at 03:18:50PM -0500, Robert Haas wrote:
> > I'm wondering if what we need to do is run pg_walsummary on both
> > summary files in that case. If we just pick one or the other, how do
> > we know which one to pick?
>
> Even if we do that, isn't it possible that none of the summaries will
> include the change?  Presently, we get the latest summarized LSN, make a
> change, and then wait for the next summary file with a greater LSN than
> what we saw before the change.  But AFAICT there's no guarantee that means
> the change has been summarized yet, although the chances of that happening
> in a test are probably pretty small.
>
> Could we get the LSN before and after making the change and then inspect
> all summaries that include that LSN range?

The trick here is that each WAL summary file covers one checkpoint
cycle. The intent of the test is to load data into the table,
checkpoint, see what summaries exist, then update a row, checkpoint
again, and see what summaries now exist. We expect one new summary
because there's been one new checkpoint. When I was thinking about
this yesterday, I was imagining that we were somehow getting an extra
checkpoint in some cases. But it looks like it's actually an
off-by-one situation. In
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10
the new files that show up between "after insert" and "after new
summary" are:

00010152FAE0015AAAC8.summary (LSN distance ~500k)
00010152F7A80152FAE0.summary (LSN distance 824 bytes)

The checkpoint after the inserts says:

LOG:  checkpoint complete: wrote 14 buffers (10.9%); 0 WAL file(s)
added, 0 removed, 0 recycled; write=0.956 s, sync=0.929 s, total=3.059
s; sync files=39, longest=0.373 s, average=0.024 s; distance=491 kB,
estimate=491 kB; lsn=0/15AAB20, redo lsn=0/15AAAC8

And the checkpoint after the single-row update says:

LOG:  checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s)
added, 0 removed, 0 recycled; write=0.648 s, sync=0.355 s, total=2.798
s; sync files=3, longest=0.348 s, average=0.119 s; distance=11 kB,
estimate=443 kB; lsn=0/15AD770, redo lsn=0/15AD718

So both of the new WAL summary files that are appearing here are from
checkpoints that happened before the single-row update. The larger
file is the one covering the 400 inserts, and the smaller one is the
checkpoint before that. Which means that the "Wait for a new summary
to show up." code isn't actually waiting long enough, and then the
whole thing goes haywire. The problem is, I think, that this code
naively thinks it can just wait for summarized_lsn and everything will
be fine ... but that assumes we were caught up when we first measured
the summarized_lsn, and that need not be so, because it takes some
short but non-zero amount of time for the summarizer to catch up with
the WAL generated during initdb.

I think the solution here is to find a better way to wait for the
inserts to be summarized, one that actually does wait for that to
happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-29 Thread Nathan Bossart
On Mon, Jan 29, 2024 at 03:18:50PM -0500, Robert Haas wrote:
> I'm wondering if what we need to do is run pg_walsummary on both
> summary files in that case. If we just pick one or the other, how do
> we know which one to pick?

Even if we do that, isn't it possible that none of the summaries will
include the change?  Presently, we get the latest summarized LSN, make a
change, and then wait for the next summary file with a greater LSN than
what we saw before the change.  But AFAICT there's no guarantee that means
the change has been summarized yet, although the chances of that happening
in a test are probably pretty small.

Could we get the LSN before and after making the change and then inspect
all summaries that include that LSN range?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-29 Thread Robert Haas
On Mon, Jan 29, 2024 at 1:21 PM Nathan Bossart  wrote:
> > Ah, I think this query:
> >
> >   SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries()
> >   WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn'
> >
> > is returning more than one row in some cases.  I attached a quick sketch of
> > an easy way to reproduce the issue as well as one way to fix it.
>
> The buildfarm just caught a failure with the new logging in place:
>
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10
>
> I'm not totally sure my "fix" is correct, but I think this does confirm the
> theory.

Ah. The possibilities of ending up with TWO new WAL summaries never
occurred to me. Things that never occurred to the developer are a
leading cause of bugs, and so here.

I'm wondering if what we need to do is run pg_walsummary on both
summary files in that case. If we just pick one or the other, how do
we know which one to pick?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-29 Thread Nathan Bossart
On Sat, Jan 27, 2024 at 10:31:09AM -0600, Nathan Bossart wrote:
> On Sat, Jan 27, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:
>> I'm discouraged by "\n1" in the file name and in the
>> "examining summary..." message.
>> regress_log_002_blocks from the following successful test run on the same
>> sungazer node contains:
>> [15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to 
>> 0/155E750
>> [15:21:58.925](0.001s) ok 1 - WAL summary file exists
> 
> Ah, I think this query:
> 
>   SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries()
>   WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn'
> 
> is returning more than one row in some cases.  I attached a quick sketch of
> an easy way to reproduce the issue as well as one way to fix it.

The buildfarm just caught a failure with the new logging in place:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10

I'm not totally sure my "fix" is correct, but I think this does confirm the
theory.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-27 Thread Nathan Bossart
On Sat, Jan 27, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:
> 24.01.2024 20:46, Robert Haas wrote:
>> This is weird. There's a little more detail in the log file,
>> regress_log_002_blocks, e.g. from the first failure you linked:
>> 
>> [11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
>> [11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
>> [11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 
>> 0/155BAF0
>> # 1
>> ...
>> [11:18:21.349](0.000s) #  got: 'pg_walsummary: error: could
>> not open file 
>> "/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0
>> # 1.summary": No such file or directory'
>> 
>> The "examining summary" line is generated based on the output of
>> pg_available_wal_summaries(). The way that works is that the server
>> calls readdir(), disassembles the filename into a TLI and two LSNs,
>> and returns the result.
> 
> I'm discouraged by "\n1" in the file name and in the
> "examining summary..." message.
> regress_log_002_blocks from the following successful test run on the same
> sungazer node contains:
> [15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to 
> 0/155E750
> [15:21:58.925](0.001s) ok 1 - WAL summary file exists

Ah, I think this query:

SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries()
WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn'

is returning more than one row in some cases.  I attached a quick sketch of
an easy way to reproduce the issue as well as one way to fix it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl
index 40908da8cb..5609b1bd14 100644
--- a/src/bin/pg_walsummary/t/002_blocks.pl
+++ b/src/bin/pg_walsummary/t/002_blocks.pl
@@ -64,10 +64,16 @@ SELECT EXISTS (
 )
 EOM
 
+$node1->safe_psql('postgres', safe_psql('postgres', < '$summarized_lsn'
+	WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn' ORDER BY end_lsn LIMIT 1
 EOM
 my ($tli, $start_lsn, $end_lsn) = split(/\|/, $details);
 note("examining summary for TLI $tli from $start_lsn to $end_lsn");


Re: cleanup patches for incremental backup

2024-01-27 Thread Alexander Lakhin

24.01.2024 20:46, Robert Haas wrote:

This is weird. There's a little more detail in the log file,
regress_log_002_blocks, e.g. from the first failure you linked:

[11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
[11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
[11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0
# 1
...
[11:18:21.349](0.000s) #  got: 'pg_walsummary: error: could
not open file 
"/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0
# 1.summary": No such file or directory'

The "examining summary" line is generated based on the output of
pg_available_wal_summaries(). The way that works is that the server
calls readdir(), disassembles the filename into a TLI and two LSNs,
and returns the result.


I'm discouraged by "\n1" in the file name and in the
"examining summary..." message.
regress_log_002_blocks from the following successful test run on the same
sungazer node contains:
[15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to 0/155E750
[15:21:58.925](0.001s) ok 1 - WAL summary file exists

Best regards,
Alexander




Re: cleanup patches for incremental backup

2024-01-26 Thread Alexander Lakhin

Hello Robert,

26.01.2024 21:37, Robert Haas wrote:

On Fri, Jan 26, 2024 at 12:39 PM Nathan Bossart
  wrote:

On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote:

Here is v2 with that addition.

Looks reasonable.

Thanks for the report & review. I have committed that version.


While trying to reproduce the 002_blocks test failure, I've encountered
another anomaly (or two):
make -s check -C src/bin/pg_walsummary/ PROVE_TESTS="t/002*" 
PROVE_FLAGS="--timer"
# +++ tap check in src/bin/pg_walsummary +++
[05:40:38] t/002_blocks.pl .. # poll_query_until timed out executing this query:
# SELECT EXISTS (
# SELECT * from pg_available_wal_summaries()
# WHERE tli = 0 AND end_lsn > '0/0'
# )
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[05:40:38] t/002_blocks.pl .. ok   266739 ms ( 0.00 usr  0.01 sys + 17.51 cusr 
26.79 csys = 44.31 CPU)
[05:45:05]
All tests successful.
Files=1, Tests=3, 267 wallclock secs ( 0.02 usr  0.02 sys + 17.51 cusr 26.79 
csys = 44.34 CPU)
Result: PASS

It looks like the test may call pg_get_wal_summarizer_state() when
WalSummarizerCtl->initialized is false yet, i. e. before the first
GetOldestUnsummarizedLSN() call.
I could reproduce the issue easily (within 10 test runs) with
pg_usleep(10L);
added inside WalSummarizerMain() just below:
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);

But the fact that the test passes regardless of the timeout, make me
wonder, whether any test should fail when such timeout occurs?

Best regards,
Alexander

Re: cleanup patches for incremental backup

2024-01-26 Thread Robert Haas
On Fri, Jan 26, 2024 at 12:39 PM Nathan Bossart
 wrote:
> On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote:
> > Here is v2 with that addition.
>
> Looks reasonable.

Thanks for the report & review. I have committed that version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-26 Thread Nathan Bossart
On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote:
> Here is v2 with that addition.

Looks reasonable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-26 Thread Robert Haas
On Thu, Jan 25, 2024 at 11:08 AM Nathan Bossart
 wrote:
> On Thu, Jan 25, 2024 at 10:06:41AM -0500, Robert Haas wrote:
> > On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart  
> > wrote:
> >> That seems like a reasonable starting point.  Even if it doesn't help
> >> determine the root cause, it should at least help rule out concurrent
> >> summary removal.
> >
> > Here is a patch for that.
>
> LGTM.  The only thing I might add is the cutoff_time in that LOG.

Here is v2 with that addition.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Temporary-patch-to-help-debug-pg_walsummary-test-.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-25 Thread Nathan Bossart
On Thu, Jan 25, 2024 at 10:06:41AM -0500, Robert Haas wrote:
> On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart  
> wrote:
>> That seems like a reasonable starting point.  Even if it doesn't help
>> determine the root cause, it should at least help rule out concurrent
>> summary removal.
> 
> Here is a patch for that.

LGTM.  The only thing I might add is the cutoff_time in that LOG.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-25 Thread Robert Haas
On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart  wrote:
> That seems like a reasonable starting point.  Even if it doesn't help
> determine the root cause, it should at least help rule out concurrent
> summary removal.

Here is a patch for that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Temporary-patch-to-help-debug-pg_walsummary-test-.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-24 Thread Nathan Bossart
On Wed, Jan 24, 2024 at 02:08:08PM -0500, Robert Haas wrote:
> On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart  
> wrote:
>> Otherwise, I think we'll probably need to add some additional logging to
>> figure out what is happening...
> 
> Where, though? I suppose we could:
> 
> 1. Change the server code so that it logs each WAL summary file
> removed at a log level high enough to show up in the test logs.
> 
> 2. Change the TAP test so that it prints out readdir(WAL summary
> directory) at various points in the test.

That seems like a reasonable starting point.  Even if it doesn't help
determine the root cause, it should at least help rule out concurrent
summary removal.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-24 Thread Robert Haas
On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart  wrote:
> There might be an overflow risk in the cutoff time calculation, but I doubt
> that's the root cause of these failures:
>
> /*
>  * Files should only be removed if the last modification time 
> precedes the
>  * cutoff time we compute here.
>  */
> cutoff_time = time(NULL) - 60 * wal_summary_keep_time;
>
> Otherwise, I think we'll probably need to add some additional logging to
> figure out what is happening...

Where, though? I suppose we could:

1. Change the server code so that it logs each WAL summary file
removed at a log level high enough to show up in the test logs.

2. Change the TAP test so that it prints out readdir(WAL summary
directory) at various points in the test.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-24 Thread Nathan Bossart
On Wed, Jan 24, 2024 at 12:46:16PM -0500, Robert Haas wrote:
> The "examining summary" line is generated based on the output of
> pg_available_wal_summaries(). The way that works is that the server
> calls readdir(), disassembles the filename into a TLI and two LSNs,
> and returns the result. Then, a fraction of a second later, the test
> script reassembles those components into a filename and finds the file
> missing. If the logic to translate between filenames and TLIs & LSNs
> were incorrect, the test would fail consistently. So the only
> explanation that seems to fit the facts is the file disappearing out
> from under us. But that really shouldn't happen. We do have code to
> remove such files in MaybeRemoveOldWalSummaries(), but it's only
> supposed to be nuking files more than 10 days old.
> 
> So I don't really have a theory here as to what could be happening. :-(

There might be an overflow risk in the cutoff time calculation, but I doubt
that's the root cause of these failures:

/*
 * Files should only be removed if the last modification time precedes 
the
 * cutoff time we compute here.
 */
cutoff_time = time(NULL) - 60 * wal_summary_keep_time;

Otherwise, I think we'll probably need to add some additional logging to
figure out what is happening...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-24 Thread Robert Haas
On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart
 wrote:
> I'm seeing some recent buildfarm failures for pg_walsummary:
>
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36
>
> The signature looks nearly identical in each:
>
> #   Failed test 'WAL summary file exists'
> #   at t/002_blocks.pl line 79.
>
> #   Failed test 'stdout shows block 0 modified'
> #   at t/002_blocks.pl line 85.
> #   ''
> # doesn't match '(?^m:FORK main: block 0$)'
>
> I haven't been able to reproduce the issue on my machine, and I haven't
> figured out precisely what is happening yet, but I wanted to make sure
> there is awareness.

This is weird. There's a little more detail in the log file,
regress_log_002_blocks, e.g. from the first failure you linked:

[11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0
[11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08
[11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0
# 1
...
[11:18:21.349](0.000s) #  got: 'pg_walsummary: error: could
not open file 
"/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0
# 1.summary": No such file or directory'

The "examining summary" line is generated based on the output of
pg_available_wal_summaries(). The way that works is that the server
calls readdir(), disassembles the filename into a TLI and two LSNs,
and returns the result. Then, a fraction of a second later, the test
script reassembles those components into a filename and finds the file
missing. If the logic to translate between filenames and TLIs & LSNs
were incorrect, the test would fail consistently. So the only
explanation that seems to fit the facts is the file disappearing out
from under us. But that really shouldn't happen. We do have code to
remove such files in MaybeRemoveOldWalSummaries(), but it's only
supposed to be nuking files more than 10 days old.

So I don't really have a theory here as to what could be happening. :-(

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-24 Thread Nathan Bossart
I'm seeing some recent buildfarm failures for pg_walsummary:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36

The signature looks nearly identical in each:

#   Failed test 'WAL summary file exists'
#   at t/002_blocks.pl line 79.

#   Failed test 'stdout shows block 0 modified'
#   at t/002_blocks.pl line 85.
#   ''
# doesn't match '(?^m:FORK main: block 0$)'

I haven't been able to reproduce the issue on my machine, and I haven't
figured out precisely what is happening yet, but I wanted to make sure
there is awareness.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: cleanup patches for incremental backup

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 4:50 AM Matthias van de Meent
 wrote:
> Sure, that's fine with me.

OK, committed that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-18 Thread Matthias van de Meent
On Wed, 17 Jan 2024 at 21:10, Robert Haas  wrote:
>
> On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent
>  wrote:
> > Sure, added in attached.
>
> I think this mostly looks good now but I just realized that I think
> this needs rephrasing:
>
> + To restore incremental backups the tool  linkend="app-pgcombinebackup"/>
> + is used, which combines incremental backups with a base backup and
> + WAL to restore a
> + database cluster to
> + a consistent state.
>
> The way this is worded, at least to me, it makes it sound like
> pg_combinebackup is going to do the WAL recovery for you, which it
> isn't. Maybe:
>
> To restore incremental backups the tool  linkend="app-pgcombinebackup"/> is used, which combines incremental
> backups with a base backup. Afterwards, recovery can use  linkend="glossary-wal">WAL to bring the  linkend="glossary-db-cluster">database cluster to a
> consistent state.

Sure, that's fine with me.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: cleanup patches for incremental backup

2024-01-17 Thread Robert Haas
On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent
 wrote:
> Sure, added in attached.

I think this mostly looks good now but I just realized that I think
this needs rephrasing:

+ To restore incremental backups the tool 
+ is used, which combines incremental backups with a base backup and
+ WAL to restore a
+ database cluster to
+ a consistent state.

The way this is worded, at least to me, it makes it sound like
pg_combinebackup is going to do the WAL recovery for you, which it
isn't. Maybe:

To restore incremental backups the tool  is used, which combines incremental
backups with a base backup. Afterwards, recovery can use WAL to bring the database cluster to a
consistent state.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-17 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 21:49, Robert Haas  wrote:
>
> On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent
>  wrote:
> + A special base 
> backup
> + that for some WAL-logged relations only contains the pages that were
> + modified since a previous backup, as opposed to the full relation data 
> of
> + normal base backups. Like base backups, it is generated by the tool
> + .
>
> Could we say "that for some files may contain only those pages that
> were modified since a previous backup, as opposed to the full contents
> of every file"?

Sure, added in attached.

> + To restore incremental backups the tool 
> + is used, which combines the incremental backups with a base backup and
> + [...]
> I wondered if this needed to be clearer that the chain of backups
> could have length > 2. But on further reflection, I think it's fine,
> unless you feel otherwise.

I removed "the" from the phrasing "the incremental backups", which
makes it a bit less restricted.

> The rest LGTM.

In the latest patch I also fixed the casing of "Incremental Backup" to
"... backup", to be in line with most other multi-word items.

Thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-16 Thread Robert Haas
On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent
 wrote:
> > One other thought is that the incremental backup only replaces
> > relation files with incremental files, and it never does anything
> > about FSM files. So the statement that it only contains data that was
> > potentially changed isn't quite correct. It might be better to phrase
> > it the other way around i.e. it is like a full backup, except that
> > some files can be replaced by incremental files which omit blocks to
> > which no WAL-logged changes have been made.
>
> How about the attached?

I like the direction.

+ A special base backup
+ that for some WAL-logged relations only contains the pages that were
+ modified since a previous backup, as opposed to the full relation data of
+ normal base backups. Like base backups, it is generated by the tool
+ .

Could we say "that for some files may contain only those pages that
were modified since a previous backup, as opposed to the full contents
of every file"? My thoughts are (1) there's no hard guarantee that an
incremental backup will replace even one file with an incremental
file, although in practice it is probably almost always going to
happen and (2) pg_combinebackup would actually be totally fine with
any file at all being sent incrementally; it's only that the server
isn't smart enough to figure out how to do this with e.g. SLRU data
right now.

+ To restore incremental backups the tool 
+ is used, which combines the incremental backups with a base backup and
+ WAL to restore a
+ database cluster to
+ a consistent state.

I wondered if this needed to be clearer that the chain of backups
could have length > 2. But on further reflection, I think it's fine,
unless you feel otherwise.

The rest LGTM.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-16 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 16:39, Robert Haas  wrote:
>
> On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent
>  wrote:
> > Off-list I was notified that the new WAL summarizer process was not
> > yet added to the glossary, so PFA a patch that does that.
> > In passing, it also adds "incremental backup" to the glossary, and
> > updates the documented types of backends in monitoring.sgml with the
> > new backend type, too.
>
> I wonder if it's possible that you sent the wrong version of this
> patch, because:
>
> (1) The docs don't build with this applied. I'm not sure if it's the
> only problem, but  the closing >.

That's my mistake, I didn't check install-world yet due to unrelated
issues building the docs. I've since sorted out these issues (this was
a good stick to get that done), so this issue is fixed in the attached
patch.

> (2) The changes to monitoring.sgml contain an unrelated change, about
> pg_stat_all_indexes.idx_scan.

Thanks for noticing, fixed in attached.

> Also, I think the "For more information, see  /> bit should have a period after the markup tag, as we seem to do in
> other cases.

Fixed.

> One other thought is that the incremental backup only replaces
> relation files with incremental files, and it never does anything
> about FSM files. So the statement that it only contains data that was
> potentially changed isn't quite correct. It might be better to phrase
> it the other way around i.e. it is like a full backup, except that
> some files can be replaced by incremental files which omit blocks to
> which no WAL-logged changes have been made.

How about the attached?


Kind regards,

Matthias van de Meent


v2-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-16 Thread Robert Haas
On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent
 wrote:
> Off-list I was notified that the new WAL summarizer process was not
> yet added to the glossary, so PFA a patch that does that.
> In passing, it also adds "incremental backup" to the glossary, and
> updates the documented types of backends in monitoring.sgml with the
> new backend type, too.

I wonder if it's possible that you sent the wrong version of this
patch, because:

(1) The docs don't build with this applied. I'm not sure if it's the
only problem, but .

(2) The changes to monitoring.sgml contain an unrelated change, about
pg_stat_all_indexes.idx_scan.

Also, I think the "For more information, see  bit should have a period after the markup tag, as we seem to do in
other cases.

One other thought is that the incremental backup only replaces
relation files with incremental files, and it never does anything
about FSM files. So the statement that it only contains data that was
potentially changed isn't quite correct. It might be better to phrase
it the other way around i.e. it is like a full backup, except that
some files can be replaced by incremental files which omit blocks to
which no WAL-logged changes have been made.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-15 Thread Matthias van de Meent
On Mon, 15 Jan 2024 at 17:58, Robert Haas  wrote:
>
> On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> > I've found one more typo in the sgml:
> > summarized_pid
> > And one in a comment:
> > sumamry
> >
> > A trivial fix is attached.
>
> Thanks, committed.

Off-list I was notified that the new WAL summarizer process was not
yet added to the glossary, so PFA a patch that does that.
In passing, it also adds "incremental backup" to the glossary, and
updates the documented types of backends in monitoring.sgml with the
new backend type, too.

Kind regards,

Matthias van de Meent.


v1-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data


Re: cleanup patches for incremental backup

2024-01-15 Thread Robert Haas
On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin  wrote:
> I've found one more typo in the sgml:
> summarized_pid
> And one in a comment:
> sumamry
>
> A trivial fix is attached.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-13 Thread Alexander Lakhin

Hello Robert,

12.01.2024 17:50, Robert Haas wrote:

On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan
- FSIP)  wrote:

Thank you for developing the new tool. I have attached a patch that corrects 
the spelling of the --individual option in the SGML file.

Thanks, committed.


I've found one more typo in the sgml:
summarized_pid
And one in a comment:
sumamry

A trivial fix is attached.

Best regards,
Alexanderdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f7d409e60..210c7c0b02 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26578,7 +26578,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
 record that it has consumed, which must always be greater than or
 equal to summarized_lsn; if the WAL summarizer is
 not running, it will be equal to summarized_lsn.
-summarized_pid is the PID of the WAL summarizer
+summarizer_pid is the PID of the WAL summarizer
 process, if it is running, and otherwise NULL.

   
diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl
index d473471bc7..d609d2c547 100644
--- a/src/bin/pg_walsummary/t/002_blocks.pl
+++ b/src/bin/pg_walsummary/t/002_blocks.pl
@@ -63,7 +63,7 @@ SELECT EXISTS (
 )
 EOM
 
-# Figure out the exact details for the new sumamry file.
+# Figure out the exact details for the new summary file.
 my $details = $node1->safe_psql('postgres', < '$summarized_lsn'


Re: cleanup patches for incremental backup

2024-01-12 Thread Robert Haas
On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan
- FSIP)  wrote:
> Thank you for developing the new tool. I have attached a patch that corrects 
> the spelling of the --individual option in the SGML file.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: cleanup patches for incremental backup

2024-01-11 Thread Shinoda, Noriyoshi (HPE Services Japan - FSIP)
Hi,
Thank you for developing the new tool. I have attached a patch that corrects 
the spelling of the --individual option in the SGML file.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Robert Haas  
Sent: Friday, January 12, 2024 3:13 AM
To: PostgreSQL Hackers ; Jakub Wartak 

Subject: Re: cleanup patches for incremental backup

On Tue, Jan 9, 2024 at 1:18 PM Robert Haas  wrote:
> Here's v2. I plan to commit the rest of this fairly soon if there are 
> no comments.

Done, with a minor adjustment to 0003.

--
Robert Haas
EDB: http://www.enterprisedb.com 




pg_walsummary_sgml_v1.diff
Description: pg_walsummary_sgml_v1.diff


Re: cleanup patches for incremental backup

2024-01-11 Thread Robert Haas
On Tue, Jan 9, 2024 at 1:18 PM Robert Haas  wrote:
> Here's v2. I plan to commit the rest of this fairly soon if there are
> no comments.

Done, with a minor adjustment to 0003.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: cleanup patches for incremental backup

2024-01-09 Thread Robert Haas
Hello again,

I have now committed 0001.

I got some off-list review of 0004; specifically, Michael Paquier said
it looked OK, and Tom Lane found another bug. So I've added a fix for
that in what's now 0003.

Here's v2. I plan to commit the rest of this fairly soon if there are
no comments.

...Robert


v2-0002-Add-new-pg_walsummary-tool.patch
Description: Binary data


v2-0003-Repair-various-defects-in-dc212340058b4e7ecfc5a7a.patch
Description: Binary data


v2-0001-Add-new-function-pg_get_wal_summarizer_state.patch
Description: Binary data