Re: cleanup patches for incremental backup
On Wed, Mar 20, 2024 at 02:53:01PM -0400, Robert Haas wrote: > On Wed, Mar 20, 2024 at 2:35 PM Nathan Bossart > wrote: >> Committed. > > Thanks. Sorry you had to clean up after me. :-( No worries. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Wed, Mar 20, 2024 at 2:35 PM Nathan Bossart wrote: > On Tue, Mar 19, 2024 at 01:15:02PM -0500, Nathan Bossart wrote: > > Assuming there are no objections or feedback, I plan to commit these two > > patches within the next couple of days. > > Committed. Thanks. Sorry you had to clean up after me. :-( -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Tue, Mar 19, 2024 at 01:15:02PM -0500, Nathan Bossart wrote: > Assuming there are no objections or feedback, I plan to commit these two > patches within the next couple of days. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Thu, Mar 14, 2024 at 08:52:55PM -0500, Nathan Bossart wrote: > Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary > test failures." > Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries(). Assuming there are no objections or feedback, I plan to commit these two patches within the next couple of days. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Thu, Mar 14, 2024 at 04:00:10PM -0500, Nathan Bossart wrote: > Separately, I suppose it's probably time to revert the temporary debugging > code adding by commit 5ddf997. I can craft a patch for that, too. As promised... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 3923e30acb1d4e21ce311b25edbcd8b96b4223d2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 14 Mar 2024 20:36:48 -0500 Subject: [PATCH v1 1/2] Revert "Temporary patch to help debug pg_walsummary test failures." This reverts commit 5ddf9973477729cf161b4ad0a1efd52f4fea9c88. --- src/backend/backup/walsummary.c | 7 --- src/bin/pg_walsummary/t/002_blocks.pl | 14 -- 2 files changed, 21 deletions(-) diff --git a/src/backend/backup/walsummary.c b/src/backend/backup/walsummary.c index 4d047e1c02..322ae3c3ad 100644 --- a/src/backend/backup/walsummary.c +++ b/src/backend/backup/walsummary.c @@ -252,15 +252,8 @@ RemoveWalSummaryIfOlderThan(WalSummaryFile *ws, time_t cutoff_time) ereport(ERROR, (errcode_for_file_access(), errmsg("could not stat file \"%s\": %m", path))); - /* XXX temporarily changed to debug buildfarm failures */ -#if 0 ereport(DEBUG2, (errmsg_internal("removing file \"%s\"", path))); -#else - ereport(LOG, - (errmsg_internal("removing file \"%s\" cutoff_time=%llu", path, - (unsigned long long) cutoff_time))); -#endif } /* diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl index d4bae3d564..52d3bd8840 100644 --- a/src/bin/pg_walsummary/t/002_blocks.pl +++ b/src/bin/pg_walsummary/t/002_blocks.pl @@ -51,7 +51,6 @@ my $summarized_lsn = $node1->safe_psql('postgres',From 38dda89ee48736489d37e18dea186e90358468b0 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 14 Mar 2024 20:46:02 -0500 Subject: [PATCH v1 2/2] Fix possible overflow in MaybeRemoveOldWalSummaries(). --- src/backend/postmaster/walsummarizer.c | 4 ++-- src/backend/utils/misc/guc_tables.c| 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c index ec2874c18c..c820d1f9ed 100644 --- a/src/backend/postmaster/walsummarizer.c +++ b/src/backend/postmaster/walsummarizer.c @@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr; * GUC parameters */ bool summarize_wal = false; -int wal_summary_keep_time = 10 * 24 * 60; +int wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR; static void WalSummarizerShutdown(int code, Datum arg); static XLogRecPtr GetLatestLSN(TimeLineID *tli); @@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void) * Files should only be removed if the last modification time precedes the * cutoff time we compute here. */ - cutoff_time = time(NULL) - 60 * wal_summary_keep_time; + cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE; /* Get all the summaries that currently exist. */ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 57d9de4dd9..1e71e7db4a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] = GUC_UNIT_MIN, }, &wal_summary_keep_time, - 10 * 24 * 60, /* 10 days */ + 10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */ 0, - INT_MAX, + INT_MAX / SECS_PER_MINUTE, NULL, NULL, NULL }, -- 2.25.1
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 12:05:15PM -0600, Nathan Bossart wrote: > There might be an overflow risk in the cutoff time calculation, but I doubt > that's the root cause of these failures: > > /* >* Files should only be removed if the last modification time precedes > the >* cutoff time we compute here. >*/ > cutoff_time = time(NULL) - 60 * wal_summary_keep_time; I've attached a short patch for fixing this overflow risk. Specifically, it limits wal_summary_keep_time to INT_MAX / SECS_PER_MINUTE, just like log_rotation_age. I considering checking for overflow when we subtract the keep-time from the result of time(2), but AFAICT there's only a problem if time_t is unsigned, which Wikipedia leads me to believe is unusual [0], so I figured we might be able to just wait this one out until 2038. > Otherwise, I think we'll probably need to add some additional logging to > figure out what is happening... Separately, I suppose it's probably time to revert the temporary debugging code adding by commit 5ddf997. I can craft a patch for that, too. [0] https://en.wikipedia.org/wiki/Unix_time#Representing_the_number -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c index ec2874c18c..c820d1f9ed 100644 --- a/src/backend/postmaster/walsummarizer.c +++ b/src/backend/postmaster/walsummarizer.c @@ -139,7 +139,7 @@ static XLogRecPtr redo_pointer_at_last_summary_removal = InvalidXLogRecPtr; * GUC parameters */ bool summarize_wal = false; -int wal_summary_keep_time = 10 * 24 * 60; +int wal_summary_keep_time = 10 * HOURS_PER_DAY * MINS_PER_HOUR; static void WalSummarizerShutdown(int code, Datum arg); static XLogRecPtr GetLatestLSN(TimeLineID *tli); @@ -1474,7 +1474,7 @@ MaybeRemoveOldWalSummaries(void) * Files should only be removed if the last modification time precedes the * cutoff time we compute here. */ - cutoff_time = time(NULL) - 60 * wal_summary_keep_time; + cutoff_time = time(NULL) - wal_summary_keep_time * SECS_PER_MINUTE; /* Get all the summaries that currently exist. */ wslist = GetWalSummaries(0, InvalidXLogRecPtr, InvalidXLogRecPtr); diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 57d9de4dd9..1e71e7db4a 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3293,9 +3293,9 @@ struct config_int ConfigureNamesInt[] = GUC_UNIT_MIN, }, &wal_summary_keep_time, - 10 * 24 * 60, /* 10 days */ + 10 * HOURS_PER_DAY * MINS_PER_HOUR, /* 10 days */ 0, - INT_MAX, + INT_MAX / SECS_PER_MINUTE, NULL, NULL, NULL },
Re: cleanup patches for incremental backup
On Tue, Jan 30, 2024 at 11:52 AM Robert Haas wrote: > Here's a patch for that. I now think > a7097ca630a25dd2248229f21ebce4968d85d10a was actually misguided, and > served only to mask some of the failures caused by waiting for the WAL > summary file. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Tue, Jan 30, 2024 at 10:51 AM Robert Haas wrote: > I think the solution here is to find a better way to wait for the > inserts to be summarized, one that actually does wait for that to > happen. Here's a patch for that. I now think a7097ca630a25dd2248229f21ebce4968d85d10a was actually misguided, and served only to mask some of the failures caused by waiting for the WAL summary file. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Revise-pg_walsummary-s-002_blocks-test-to-avoid-s.patch Description: Binary data
Re: cleanup patches for incremental backup
On Mon, Jan 29, 2024 at 4:13 PM Nathan Bossart wrote: > On Mon, Jan 29, 2024 at 03:18:50PM -0500, Robert Haas wrote: > > I'm wondering if what we need to do is run pg_walsummary on both > > summary files in that case. If we just pick one or the other, how do > > we know which one to pick? > > Even if we do that, isn't it possible that none of the summaries will > include the change? Presently, we get the latest summarized LSN, make a > change, and then wait for the next summary file with a greater LSN than > what we saw before the change. But AFAICT there's no guarantee that means > the change has been summarized yet, although the chances of that happening > in a test are probably pretty small. > > Could we get the LSN before and after making the change and then inspect > all summaries that include that LSN range? The trick here is that each WAL summary file covers one checkpoint cycle. The intent of the test is to load data into the table, checkpoint, see what summaries exist, then update a row, checkpoint again, and see what summaries now exist. We expect one new summary because there's been one new checkpoint. When I was thinking about this yesterday, I was imagining that we were somehow getting an extra checkpoint in some cases. But it looks like it's actually an off-by-one situation. In https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10 the new files that show up between "after insert" and "after new summary" are: 00010152FAE0015AAAC8.summary (LSN distance ~500k) 00010152F7A80152FAE0.summary (LSN distance 824 bytes) The checkpoint after the inserts says: LOG: checkpoint complete: wrote 14 buffers (10.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.956 s, sync=0.929 s, total=3.059 s; sync files=39, longest=0.373 s, average=0.024 s; distance=491 kB, estimate=491 kB; lsn=0/15AAB20, redo lsn=0/15AAAC8 And the checkpoint after the single-row update says: LOG: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.648 s, sync=0.355 s, total=2.798 s; sync files=3, longest=0.348 s, average=0.119 s; distance=11 kB, estimate=443 kB; lsn=0/15AD770, redo lsn=0/15AD718 So both of the new WAL summary files that are appearing here are from checkpoints that happened before the single-row update. The larger file is the one covering the 400 inserts, and the smaller one is the checkpoint before that. Which means that the "Wait for a new summary to show up." code isn't actually waiting long enough, and then the whole thing goes haywire. The problem is, I think, that this code naively thinks it can just wait for summarized_lsn and everything will be fine ... but that assumes we were caught up when we first measured the summarized_lsn, and that need not be so, because it takes some short but non-zero amount of time for the summarizer to catch up with the WAL generated during initdb. I think the solution here is to find a better way to wait for the inserts to be summarized, one that actually does wait for that to happen. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Mon, Jan 29, 2024 at 03:18:50PM -0500, Robert Haas wrote: > I'm wondering if what we need to do is run pg_walsummary on both > summary files in that case. If we just pick one or the other, how do > we know which one to pick? Even if we do that, isn't it possible that none of the summaries will include the change? Presently, we get the latest summarized LSN, make a change, and then wait for the next summary file with a greater LSN than what we saw before the change. But AFAICT there's no guarantee that means the change has been summarized yet, although the chances of that happening in a test are probably pretty small. Could we get the LSN before and after making the change and then inspect all summaries that include that LSN range? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Mon, Jan 29, 2024 at 1:21 PM Nathan Bossart wrote: > > Ah, I think this query: > > > > SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries() > > WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn' > > > > is returning more than one row in some cases. I attached a quick sketch of > > an easy way to reproduce the issue as well as one way to fix it. > > The buildfarm just caught a failure with the new logging in place: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10 > > I'm not totally sure my "fix" is correct, but I think this does confirm the > theory. Ah. The possibilities of ending up with TWO new WAL summaries never occurred to me. Things that never occurred to the developer are a leading cause of bugs, and so here. I'm wondering if what we need to do is run pg_walsummary on both summary files in that case. If we just pick one or the other, how do we know which one to pick? -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Sat, Jan 27, 2024 at 10:31:09AM -0600, Nathan Bossart wrote: > On Sat, Jan 27, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote: >> I'm discouraged by "\n1" in the file name and in the >> "examining summary..." message. >> regress_log_002_blocks from the following successful test run on the same >> sungazer node contains: >> [15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to >> 0/155E750 >> [15:21:58.925](0.001s) ok 1 - WAL summary file exists > > Ah, I think this query: > > SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries() > WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn' > > is returning more than one row in some cases. I attached a quick sketch of > an easy way to reproduce the issue as well as one way to fix it. The buildfarm just caught a failure with the new logging in place: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-01-29%2018%3A09%3A10 I'm not totally sure my "fix" is correct, but I think this does confirm the theory. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Sat, Jan 27, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote: > 24.01.2024 20:46, Robert Haas wrote: >> This is weird. There's a little more detail in the log file, >> regress_log_002_blocks, e.g. from the first failure you linked: >> >> [11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0 >> [11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08 >> [11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to >> 0/155BAF0 >> # 1 >> ... >> [11:18:21.349](0.000s) # got: 'pg_walsummary: error: could >> not open file >> "/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0 >> # 1.summary": No such file or directory' >> >> The "examining summary" line is generated based on the output of >> pg_available_wal_summaries(). The way that works is that the server >> calls readdir(), disassembles the filename into a TLI and two LSNs, >> and returns the result. > > I'm discouraged by "\n1" in the file name and in the > "examining summary..." message. > regress_log_002_blocks from the following successful test run on the same > sungazer node contains: > [15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to > 0/155E750 > [15:21:58.925](0.001s) ok 1 - WAL summary file exists Ah, I think this query: SELECT tli, start_lsn, end_lsn from pg_available_wal_summaries() WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn' is returning more than one row in some cases. I attached a quick sketch of an easy way to reproduce the issue as well as one way to fix it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl index 40908da8cb..5609b1bd14 100644 --- a/src/bin/pg_walsummary/t/002_blocks.pl +++ b/src/bin/pg_walsummary/t/002_blocks.pl @@ -64,10 +64,16 @@ SELECT EXISTS ( ) EOM +$node1->safe_psql('postgres',safe_psql('postgres', < '$summarized_lsn' + WHERE tli = $summarized_tli AND end_lsn > '$summarized_lsn' ORDER BY end_lsn LIMIT 1 EOM my ($tli, $start_lsn, $end_lsn) = split(/\|/, $details); note("examining summary for TLI $tli from $start_lsn to $end_lsn");
Re: cleanup patches for incremental backup
24.01.2024 20:46, Robert Haas wrote: This is weird. There's a little more detail in the log file, regress_log_002_blocks, e.g. from the first failure you linked: [11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0 [11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08 [11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0 # 1 ... [11:18:21.349](0.000s) # got: 'pg_walsummary: error: could not open file "/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0 # 1.summary": No such file or directory' The "examining summary" line is generated based on the output of pg_available_wal_summaries(). The way that works is that the server calls readdir(), disassembles the filename into a TLI and two LSNs, and returns the result. I'm discouraged by "\n1" in the file name and in the "examining summary..." message. regress_log_002_blocks from the following successful test run on the same sungazer node contains: [15:21:58.924](0.106s) # examining summary for TLI 1 from 0/155BAE0 to 0/155E750 [15:21:58.925](0.001s) ok 1 - WAL summary file exists Best regards, Alexander
Re: cleanup patches for incremental backup
Hello Robert, 26.01.2024 21:37, Robert Haas wrote: On Fri, Jan 26, 2024 at 12:39 PM Nathan Bossart wrote: On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote: Here is v2 with that addition. Looks reasonable. Thanks for the report & review. I have committed that version. While trying to reproduce the 002_blocks test failure, I've encountered another anomaly (or two): make -s check -C src/bin/pg_walsummary/ PROVE_TESTS="t/002*" PROVE_FLAGS="--timer" # +++ tap check in src/bin/pg_walsummary +++ [05:40:38] t/002_blocks.pl .. # poll_query_until timed out executing this query: # SELECT EXISTS ( # SELECT * from pg_available_wal_summaries() # WHERE tli = 0 AND end_lsn > '0/0' # ) # # expecting this output: # t # last actual query output: # f # with stderr: [05:40:38] t/002_blocks.pl .. ok 266739 ms ( 0.00 usr 0.01 sys + 17.51 cusr 26.79 csys = 44.31 CPU) [05:45:05] All tests successful. Files=1, Tests=3, 267 wallclock secs ( 0.02 usr 0.02 sys + 17.51 cusr 26.79 csys = 44.34 CPU) Result: PASS It looks like the test may call pg_get_wal_summarizer_state() when WalSummarizerCtl->initialized is false yet, i. e. before the first GetOldestUnsummarizedLSN() call. I could reproduce the issue easily (within 10 test runs) with pg_usleep(10L); added inside WalSummarizerMain() just below: sigprocmask(SIG_SETMASK, &UnBlockSig, NULL); But the fact that the test passes regardless of the timeout, make me wonder, whether any test should fail when such timeout occurs? Best regards, Alexander
Re: cleanup patches for incremental backup
On Fri, Jan 26, 2024 at 12:39 PM Nathan Bossart wrote: > On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote: > > Here is v2 with that addition. > > Looks reasonable. Thanks for the report & review. I have committed that version. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Fri, Jan 26, 2024 at 11:04:37AM -0500, Robert Haas wrote: > Here is v2 with that addition. Looks reasonable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Thu, Jan 25, 2024 at 11:08 AM Nathan Bossart wrote: > On Thu, Jan 25, 2024 at 10:06:41AM -0500, Robert Haas wrote: > > On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart > > wrote: > >> That seems like a reasonable starting point. Even if it doesn't help > >> determine the root cause, it should at least help rule out concurrent > >> summary removal. > > > > Here is a patch for that. > > LGTM. The only thing I might add is the cutoff_time in that LOG. Here is v2 with that addition. -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-Temporary-patch-to-help-debug-pg_walsummary-test-.patch Description: Binary data
Re: cleanup patches for incremental backup
On Thu, Jan 25, 2024 at 10:06:41AM -0500, Robert Haas wrote: > On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart > wrote: >> That seems like a reasonable starting point. Even if it doesn't help >> determine the root cause, it should at least help rule out concurrent >> summary removal. > > Here is a patch for that. LGTM. The only thing I might add is the cutoff_time in that LOG. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 2:39 PM Nathan Bossart wrote: > That seems like a reasonable starting point. Even if it doesn't help > determine the root cause, it should at least help rule out concurrent > summary removal. Here is a patch for that. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Temporary-patch-to-help-debug-pg_walsummary-test-.patch Description: Binary data
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 02:08:08PM -0500, Robert Haas wrote: > On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart > wrote: >> Otherwise, I think we'll probably need to add some additional logging to >> figure out what is happening... > > Where, though? I suppose we could: > > 1. Change the server code so that it logs each WAL summary file > removed at a log level high enough to show up in the test logs. > > 2. Change the TAP test so that it prints out readdir(WAL summary > directory) at various points in the test. That seems like a reasonable starting point. Even if it doesn't help determine the root cause, it should at least help rule out concurrent summary removal. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 1:05 PM Nathan Bossart wrote: > There might be an overflow risk in the cutoff time calculation, but I doubt > that's the root cause of these failures: > > /* > * Files should only be removed if the last modification time > precedes the > * cutoff time we compute here. > */ > cutoff_time = time(NULL) - 60 * wal_summary_keep_time; > > Otherwise, I think we'll probably need to add some additional logging to > figure out what is happening... Where, though? I suppose we could: 1. Change the server code so that it logs each WAL summary file removed at a log level high enough to show up in the test logs. 2. Change the TAP test so that it prints out readdir(WAL summary directory) at various points in the test. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 12:46:16PM -0500, Robert Haas wrote: > The "examining summary" line is generated based on the output of > pg_available_wal_summaries(). The way that works is that the server > calls readdir(), disassembles the filename into a TLI and two LSNs, > and returns the result. Then, a fraction of a second later, the test > script reassembles those components into a filename and finds the file > missing. If the logic to translate between filenames and TLIs & LSNs > were incorrect, the test would fail consistently. So the only > explanation that seems to fit the facts is the file disappearing out > from under us. But that really shouldn't happen. We do have code to > remove such files in MaybeRemoveOldWalSummaries(), but it's only > supposed to be nuking files more than 10 days old. > > So I don't really have a theory here as to what could be happening. :-( There might be an overflow risk in the cutoff time calculation, but I doubt that's the root cause of these failures: /* * Files should only be removed if the last modification time precedes the * cutoff time we compute here. */ cutoff_time = time(NULL) - 60 * wal_summary_keep_time; Otherwise, I think we'll probably need to add some additional logging to figure out what is happening... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Wed, Jan 24, 2024 at 12:08 PM Nathan Bossart wrote: > I'm seeing some recent buildfarm failures for pg_walsummary: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36 > > The signature looks nearly identical in each: > > # Failed test 'WAL summary file exists' > # at t/002_blocks.pl line 79. > > # Failed test 'stdout shows block 0 modified' > # at t/002_blocks.pl line 85. > # '' > # doesn't match '(?^m:FORK main: block 0$)' > > I haven't been able to reproduce the issue on my machine, and I haven't > figured out precisely what is happening yet, but I wanted to make sure > there is awareness. This is weird. There's a little more detail in the log file, regress_log_002_blocks, e.g. from the first failure you linked: [11:18:20.683](96.787s) # before insert, summarized TLI 1 through 0/14E09D0 [11:18:21.188](0.505s) # after insert, summarized TLI 1 through 0/14E0D08 [11:18:21.326](0.138s) # examining summary for TLI 1 from 0/14E0D08 to 0/155BAF0 # 1 ... [11:18:21.349](0.000s) # got: 'pg_walsummary: error: could not open file "/home/nm/farm/gcc64/HEAD/pgsql.build/src/bin/pg_walsummary/tmp_check/t_002_blocks_node1_data/pgdata/pg_wal/summaries/0001014E0D08155BAF0 # 1.summary": No such file or directory' The "examining summary" line is generated based on the output of pg_available_wal_summaries(). The way that works is that the server calls readdir(), disassembles the filename into a TLI and two LSNs, and returns the result. Then, a fraction of a second later, the test script reassembles those components into a filename and finds the file missing. If the logic to translate between filenames and TLIs & LSNs were incorrect, the test would fail consistently. So the only explanation that seems to fit the facts is the file disappearing out from under us. But that really shouldn't happen. We do have code to remove such files in MaybeRemoveOldWalSummaries(), but it's only supposed to be nuking files more than 10 days old. So I don't really have a theory here as to what could be happening. :-( -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
I'm seeing some recent buildfarm failures for pg_walsummary: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-01-14%2006%3A21%3A58 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2024-01-17%2021%3A10%3A36 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-20%2018%3A58%3A49 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-01-23%2002%3A46%3A57 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-01-23%2020%3A23%3A36 The signature looks nearly identical in each: # Failed test 'WAL summary file exists' # at t/002_blocks.pl line 79. # Failed test 'stdout shows block 0 modified' # at t/002_blocks.pl line 85. # '' # doesn't match '(?^m:FORK main: block 0$)' I haven't been able to reproduce the issue on my machine, and I haven't figured out precisely what is happening yet, but I wanted to make sure there is awareness. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: cleanup patches for incremental backup
On Thu, Jan 18, 2024 at 4:50 AM Matthias van de Meent wrote: > Sure, that's fine with me. OK, committed that way. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Wed, 17 Jan 2024 at 21:10, Robert Haas wrote: > > On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent > wrote: > > Sure, added in attached. > > I think this mostly looks good now but I just realized that I think > this needs rephrasing: > > + To restore incremental backups the tool linkend="app-pgcombinebackup"/> > + is used, which combines incremental backups with a base backup and > + WAL to restore a > + database cluster to > + a consistent state. > > The way this is worded, at least to me, it makes it sound like > pg_combinebackup is going to do the WAL recovery for you, which it > isn't. Maybe: > > To restore incremental backups the tool linkend="app-pgcombinebackup"/> is used, which combines incremental > backups with a base backup. Afterwards, recovery can use linkend="glossary-wal">WAL to bring the linkend="glossary-db-cluster">database cluster to a > consistent state. Sure, that's fine with me. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: cleanup patches for incremental backup
On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent wrote: > Sure, added in attached. I think this mostly looks good now but I just realized that I think this needs rephrasing: + To restore incremental backups the tool + is used, which combines incremental backups with a base backup and + WAL to restore a + database cluster to + a consistent state. The way this is worded, at least to me, it makes it sound like pg_combinebackup is going to do the WAL recovery for you, which it isn't. Maybe: To restore incremental backups the tool is used, which combines incremental backups with a base backup. Afterwards, recovery can use WAL to bring the database cluster to a consistent state. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Tue, 16 Jan 2024 at 21:49, Robert Haas wrote: > > On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent > wrote: > + A special base > backup > + that for some WAL-logged relations only contains the pages that were > + modified since a previous backup, as opposed to the full relation data > of > + normal base backups. Like base backups, it is generated by the tool > + . > > Could we say "that for some files may contain only those pages that > were modified since a previous backup, as opposed to the full contents > of every file"? Sure, added in attached. > + To restore incremental backups the tool > + is used, which combines the incremental backups with a base backup and > + [...] > I wondered if this needed to be clearer that the chain of backups > could have length > 2. But on further reflection, I think it's fine, > unless you feel otherwise. I removed "the" from the phrasing "the incremental backups", which makes it a bit less restricted. > The rest LGTM. In the latest patch I also fixed the casing of "Incremental Backup" to "... backup", to be in line with most other multi-word items. Thanks! Kind regards, Matthias van de Meent Neon (https://neon.tech) v3-0001-incremental-backups-Add-new-items-to-glossary-mon.patch Description: Binary data
Re: cleanup patches for incremental backup
On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent wrote: > > One other thought is that the incremental backup only replaces > > relation files with incremental files, and it never does anything > > about FSM files. So the statement that it only contains data that was > > potentially changed isn't quite correct. It might be better to phrase > > it the other way around i.e. it is like a full backup, except that > > some files can be replaced by incremental files which omit blocks to > > which no WAL-logged changes have been made. > > How about the attached? I like the direction. + A special base backup + that for some WAL-logged relations only contains the pages that were + modified since a previous backup, as opposed to the full relation data of + normal base backups. Like base backups, it is generated by the tool + . Could we say "that for some files may contain only those pages that were modified since a previous backup, as opposed to the full contents of every file"? My thoughts are (1) there's no hard guarantee that an incremental backup will replace even one file with an incremental file, although in practice it is probably almost always going to happen and (2) pg_combinebackup would actually be totally fine with any file at all being sent incrementally; it's only that the server isn't smart enough to figure out how to do this with e.g. SLRU data right now. + To restore incremental backups the tool + is used, which combines the incremental backups with a base backup and + WAL to restore a + database cluster to + a consistent state. I wondered if this needed to be clearer that the chain of backups could have length > 2. But on further reflection, I think it's fine, unless you feel otherwise. The rest LGTM. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Tue, 16 Jan 2024 at 16:39, Robert Haas wrote: > > On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent > wrote: > > Off-list I was notified that the new WAL summarizer process was not > > yet added to the glossary, so PFA a patch that does that. > > In passing, it also adds "incremental backup" to the glossary, and > > updates the documented types of backends in monitoring.sgml with the > > new backend type, too. > > I wonder if it's possible that you sent the wrong version of this > patch, because: > > (1) The docs don't build with this applied. I'm not sure if it's the > only problem, but the closing >. That's my mistake, I didn't check install-world yet due to unrelated issues building the docs. I've since sorted out these issues (this was a good stick to get that done), so this issue is fixed in the attached patch. > (2) The changes to monitoring.sgml contain an unrelated change, about > pg_stat_all_indexes.idx_scan. Thanks for noticing, fixed in attached. > Also, I think the "For more information, see /> bit should have a period after the markup tag, as we seem to do in > other cases. Fixed. > One other thought is that the incremental backup only replaces > relation files with incremental files, and it never does anything > about FSM files. So the statement that it only contains data that was > potentially changed isn't quite correct. It might be better to phrase > it the other way around i.e. it is like a full backup, except that > some files can be replaced by incremental files which omit blocks to > which no WAL-logged changes have been made. How about the attached? Kind regards, Matthias van de Meent v2-0001-incremental-backups-Add-new-items-to-glossary-mon.patch Description: Binary data
Re: cleanup patches for incremental backup
On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent wrote: > Off-list I was notified that the new WAL summarizer process was not > yet added to the glossary, so PFA a patch that does that. > In passing, it also adds "incremental backup" to the glossary, and > updates the documented types of backends in monitoring.sgml with the > new backend type, too. I wonder if it's possible that you sent the wrong version of this patch, because: (1) The docs don't build with this applied. I'm not sure if it's the only problem, but . (2) The changes to monitoring.sgml contain an unrelated change, about pg_stat_all_indexes.idx_scan. Also, I think the "For more information, see bit should have a period after the markup tag, as we seem to do in other cases. One other thought is that the incremental backup only replaces relation files with incremental files, and it never does anything about FSM files. So the statement that it only contains data that was potentially changed isn't quite correct. It might be better to phrase it the other way around i.e. it is like a full backup, except that some files can be replaced by incremental files which omit blocks to which no WAL-logged changes have been made. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
On Mon, 15 Jan 2024 at 17:58, Robert Haas wrote: > > On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin wrote: > > I've found one more typo in the sgml: > > summarized_pid > > And one in a comment: > > sumamry > > > > A trivial fix is attached. > > Thanks, committed. Off-list I was notified that the new WAL summarizer process was not yet added to the glossary, so PFA a patch that does that. In passing, it also adds "incremental backup" to the glossary, and updates the documented types of backends in monitoring.sgml with the new backend type, too. Kind regards, Matthias van de Meent. v1-0001-incremental-backups-Add-new-items-to-glossary-mon.patch Description: Binary data
Re: cleanup patches for incremental backup
On Sat, Jan 13, 2024 at 1:00 PM Alexander Lakhin wrote: > I've found one more typo in the sgml: > summarized_pid > And one in a comment: > sumamry > > A trivial fix is attached. Thanks, committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
Hello Robert, 12.01.2024 17:50, Robert Haas wrote: On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan - FSIP) wrote: Thank you for developing the new tool. I have attached a patch that corrects the spelling of the --individual option in the SGML file. Thanks, committed. I've found one more typo in the sgml: summarized_pid And one in a comment: sumamry A trivial fix is attached. Best regards, Alexanderdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 0f7d409e60..210c7c0b02 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26578,7 +26578,7 @@ SELECT collation for ('foo' COLLATE "de_DE"); record that it has consumed, which must always be greater than or equal to summarized_lsn; if the WAL summarizer is not running, it will be equal to summarized_lsn. -summarized_pid is the PID of the WAL summarizer +summarizer_pid is the PID of the WAL summarizer process, if it is running, and otherwise NULL. diff --git a/src/bin/pg_walsummary/t/002_blocks.pl b/src/bin/pg_walsummary/t/002_blocks.pl index d473471bc7..d609d2c547 100644 --- a/src/bin/pg_walsummary/t/002_blocks.pl +++ b/src/bin/pg_walsummary/t/002_blocks.pl @@ -63,7 +63,7 @@ SELECT EXISTS ( ) EOM -# Figure out the exact details for the new sumamry file. +# Figure out the exact details for the new summary file. my $details = $node1->safe_psql('postgres', < '$summarized_lsn'
Re: cleanup patches for incremental backup
On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan - FSIP) wrote: > Thank you for developing the new tool. I have attached a patch that corrects > the spelling of the --individual option in the SGML file. Thanks, committed. -- Robert Haas EDB: http://www.enterprisedb.com
RE: cleanup patches for incremental backup
Hi, Thank you for developing the new tool. I have attached a patch that corrects the spelling of the --individual option in the SGML file. Regards, Noriyoshi Shinoda -Original Message- From: Robert Haas Sent: Friday, January 12, 2024 3:13 AM To: PostgreSQL Hackers ; Jakub Wartak Subject: Re: cleanup patches for incremental backup On Tue, Jan 9, 2024 at 1:18 PM Robert Haas wrote: > Here's v2. I plan to commit the rest of this fairly soon if there are > no comments. Done, with a minor adjustment to 0003. -- Robert Haas EDB: http://www.enterprisedb.com pg_walsummary_sgml_v1.diff Description: pg_walsummary_sgml_v1.diff
Re: cleanup patches for incremental backup
On Tue, Jan 9, 2024 at 1:18 PM Robert Haas wrote: > Here's v2. I plan to commit the rest of this fairly soon if there are > no comments. Done, with a minor adjustment to 0003. -- Robert Haas EDB: http://www.enterprisedb.com
Re: cleanup patches for incremental backup
Hello again, I have now committed 0001. I got some off-list review of 0004; specifically, Michael Paquier said it looked OK, and Tom Lane found another bug. So I've added a fix for that in what's now 0003. Here's v2. I plan to commit the rest of this fairly soon if there are no comments. ...Robert v2-0002-Add-new-pg_walsummary-tool.patch Description: Binary data v2-0003-Repair-various-defects-in-dc212340058b4e7ecfc5a7a.patch Description: Binary data v2-0001-Add-new-function-pg_get_wal_summarizer_state.patch Description: Binary data