Re: Flush pgstats file during checkpoints
On Mon, Sep 02, 2024 at 05:08:03PM +0300, Heikki Linnakangas wrote: > Hmm, I'm a bit disappointed this doesn't address replication. It makes sense > that scans are counted separately on a standby, but it would be nice if > stats like last_vacuum were propagated from primary to standbys. I guess > that can be handled separately later. Yes, it's not something that I'm planning to tackle for this thread. Speaking of which, the design that I got in mind for this area was not "that" complicated: - Add a new RMGR for all the stats. - Add a first callback for stats kinds for WAL inserts, giving to each stats the possibility to pass down data inserted to the record, as we want to replicate a portion of the data depending on the kind dealt with. - Add a second callback for recovery, called depending on the kind ID. I have not looked into the details yet, but stats to replicate should be grouped in a single record on transaction commit or depending on the flush timing for fixed-numbered stats. Or we should just add them in commit records? > Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch: > > There are various race conditions where a stats entry can be leaked in the > pgstats file. I.e. relation is dropped, but its stats entry is retained in > the stats file after crash. In the worst case, suck leaked entries can > accumulate until the stats file is manually removed, which resets all stats > again. Perhaps that's acceptable - it's still possible leak the actual > relation file for a new relation on crash, after all, which is much worse > (I'm glad Horiguchi-san is working on that [1]). Yeah, that's not an easy issue. We don't really have a protection regarding that as well now. Backends can also refer to stats entries in their shutdown callback that have been dropped concurrently. See some details about that at https://commitfest.postgresql.org/49/5045/. > Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned > entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or can > we make it more watertight so that there are no leaks? Not sure about this part, TBH. Doing that again in autovacuum does not excite me much as it has a cost. > I think it's failing to flush the stats file at the end of recovery > checkpoint. Missed that, oops. I'll double-check this area. -- Michael signature.asc Description: PGP signature
Re: Flush pgstats file during checkpoints
It'd not be such an issue if we updated stats during recovery, but I think think we're doing that. Perhaps we should, which might also help on replicas - no idea if it's feasible, though. Stats on replicas are considered an independent thing AFAIU (scans are counted for example in read-only queries). If we were to do that we may want to split stats handling between nodes in standby state and crash recovery. Not sure if that's worth the complication. First, the stats exist at node-level. Hmm, I'm a bit disappointed this doesn't address replication. It makes sense that scans are counted separately on a standby, but it would be nice if stats like last_vacuum were propagated from primary to standbys. I guess that can be handled separately later. Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch: There are various race conditions where a stats entry can be leaked in the pgstats file. I.e. relation is dropped, but its stats entry is retained in the stats file after crash. In the worst case, suck leaked entries can accumulate until the stats file is manually removed, which resets all stats again. Perhaps that's acceptable - it's still possible leak the actual relation file for a new relation on crash, after all, which is much worse (I'm glad Horiguchi-san is working on that [1]). For example: 1. BEGIN; CREATE TABLE foo (); ANALYZE foo; 2. CHECKPOINT; 3. pg_ctl restart -m immediate This is the same scenario where we leak the relfile, but now you can have it with e.g. function statistics too. Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or can we make it more watertight so that there are no leaks? If you do this: pg_ctl start -D data pg_ctl stop -D data -m immediate pg_ctl start -D data pg_ctl stop -D data -m immediate You get this in the log: 2024-09-02 16:28:37.874 EEST [1397281] WARNING: found incorrect redo LSN 0/160A3C8 (expected 0/160A440) I think it's failing to flush the stats file at the end of recovery checkpoint. [1] https://www.postgresql.org/message-id/20240901.010925.656452225144636594.horikyota.ntt%40gmail.com -- Heikki Linnakangas Neon (https://neon.tech)
Re: Flush pgstats file during checkpoints
Hi, On Mon, Aug 26, 2024 at 01:56:40PM +0900, Michael Paquier wrote: > On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote: > > Applied 0003 for now to add the redo LSN to the pgstats file, adding > > the redo LSN to the two DEBUG2 entries when reading and writing while > > on it, that I forgot. (It was not 01:57 where I am now.) > > > > Attached is the last one. > > The CF bot has been complaining in injection_points as an effect of > the stats remaining after a crash, so rebased to adapt to that. Thanks! Checking the V7 diffs as compared to V4: 1. In pgstat_write_statsfile(): - elog(DEBUG2, "writing stats file \"%s\"", statfile); + elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile, +LSN_FORMAT_ARGS(redo)); 2. and the ones in injection_points/t/001_stats.pl: +# On crash the stats are still there. $node->stop('immediate'); $node->start; $numcalls = $node->safe_psql('postgres', "SELECT injection_points_stats_numcalls('stats-notice');"); -is($numcalls, '', 'number of stats after crash'); +is($numcalls, '3', 'number of stats after crash'); $fixedstats = $node->safe_psql('postgres', "SELECT * FROM injection_points_stats_fixed();"); -is($fixedstats, '0|0|0|0|0', 'fixed stats after crash'); +is($fixedstats, '1|0|2|1|1', 'fixed stats after crash'); They both LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote: > Applied 0003 for now to add the redo LSN to the pgstats file, adding > the redo LSN to the two DEBUG2 entries when reading and writing while > on it, that I forgot. (It was not 01:57 where I am now.) > > Attached is the last one. The CF bot has been complaining in injection_points as an effect of the stats remaining after a crash, so rebased to adapt to that. -- Michael From 2874c96f0325ca8e7495366ea08376de6568f438 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 26 Aug 2024 13:55:28 +0900 Subject: [PATCH v7] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the before_shmem_exit() callback to flush the pgstats file in a shutdown sequence is a bit easier than doing so at checkpoint, as they may be stats to flush before we are completely done with the shutdown checkpoint. Keeping the callback also keeps the shutdown of single-user mode simpler, where the stats also need to be flushed. At the beginning of recovery, the stats file is loaded when the redo LSN it stores matches with the point we are replaying from. The file is not removed anymore after being read, to ensure that there is still a source of stats to feed on should the system crash until the next checkpoint that would update the stats. --- src/include/pgstat.h | 4 +- src/backend/access/transam/xlog.c | 30 + src/backend/utils/activity/pgstat.c | 65 --- .../modules/injection_points/t/001_stats.pl | 6 +- src/test/recovery/t/029_stats_restart.pl | 26 ++-- .../recovery/t/030_stats_cleanup_replica.pl | 2 +- 6 files changed, 60 insertions(+), 73 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index f63159c55c..87a4d2e4f7 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -495,9 +495,9 @@ typedef struct PgStat_PendingWalStats extern Size StatsShmemSize(void); extern void StatsShmemInit(void); -/* Functions called during server startup / shutdown */ +/* Functions called during server startup / checkpoint / shutdown */ extern void pgstat_restore_stats(XLogRecPtr redo); -extern void pgstat_discard_stats(void); +extern void pgstat_flush_stats(XLogRecPtr redo); extern void pgstat_before_server_shutdown(int code, Datum arg); /* Functions for backend initialization */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ee0fb0e28f..52d43436aa 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5402,7 +5402,6 @@ StartupXLOG(void) XLogCtlInsert *Insert; CheckPoint checkPoint; bool wasShutdown; - bool didCrash; bool haveTblspcMap; bool haveBackupLabel; XLogRecPtr EndOfLog; @@ -5522,10 +5521,7 @@ StartupXLOG(void) { RemoveTempXlogFiles(); SyncDataDirectory(); - didCrash = true; } - else - didCrash = false; /* * Prepare for WAL recovery if needed. @@ -5646,14 +5642,8 @@ StartupXLOG(void) * * NB: Restoring replication slot stats relies on slot state to have * already been restored from disk. - * - * TODO: With a bit of extra work we could just start with a pgstat file - * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo); lastFullPageWrites = checkPoint.fullPageWrites; @@ -7251,6 +7241,15 @@ CreateCheckPoint(int flags) */ END_CRIT_SECTION(); + /* + * The control file update is done, hence it is time to write some fresh + * statistics. Stats are flushed at shutdown by the checkpointer in a + * dedicated before_shmem_exit callback, combined with sanity checks + * related to the shutdown of pgstats. + */ + if (!shutdown) + pgstat_flush_stats(RedoRecPtr); + /* * WAL summaries end when the next XLOG_CHECKPOINT_REDO or * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point @@ -7720,6 +7719,15 @@ CreateRestartPoint(int flags) } LWLockRelease(ControlFileLock); + /* + * The control file update is done, hence it is time to write some fresh + * statistics. Stats are flushed at shutdown by the checkpointer in a + * dedicated before_shmem_exit callback, combined with sanity checks + * related to the shutdown of pgstats. + */ + if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0) + pgstat_flush_stats(RedoRecPtr); + /* * Update the average distance between checkpoints/restartpoints if the * prior checkpoint exists. diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index b2ca3f39b7..1f0fcc39c4 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -10,9 +10,10 @@ * statistics. * * Statis
Re: Flush pgstats file during checkpoints
On Tue, Jul 30, 2024 at 08:53:48AM +, Bertrand Drouvot wrote: > Did a quick check and still LGTM. Applied 0003 for now to add the redo LSN to the pgstats file, adding the redo LSN to the two DEBUG2 entries when reading and writing while on it, that I forgot. (It was not 01:57 where I am now.) Attached is the last one. -- Michael From 8f1f7a9ca9c19dae88057a5593b0f9c0cd748a88 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 17 Jul 2024 12:42:43 +0900 Subject: [PATCH v6] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the before_shmem_exit() callback to flush the pgstats file in a shutdown sequence is a bit easier than doing so at checkpoint, as they may be stats to flush before we are completely done with the shutdown checkpoint. Keeping the callback also keeps the shutdown of single-user mode simpler, where the stats also need to be flushed. At the beginning of recovery, the stats file is loaded when the redo LSN it stores matches with the point we are replaying from. The file is not removed anymore after being read, to ensure that there is still a source of stats to feed on should the system crash until the next checkpoint that would update the stats. --- src/include/pgstat.h | 4 +- src/backend/access/transam/xlog.c | 30 + src/backend/utils/activity/pgstat.c | 65 --- src/test/recovery/t/029_stats_restart.pl | 26 ++-- .../recovery/t/030_stats_cleanup_replica.pl | 2 +- 5 files changed, 57 insertions(+), 70 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 043d39a565..f780b56cc3 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats extern Size StatsShmemSize(void); extern void StatsShmemInit(void); -/* Functions called during server startup / shutdown */ +/* Functions called during server startup / checkpoint / shutdown */ extern void pgstat_restore_stats(XLogRecPtr redo); -extern void pgstat_discard_stats(void); +extern void pgstat_flush_stats(XLogRecPtr redo); extern void pgstat_before_server_shutdown(int code, Datum arg); /* Functions for backend initialization */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6499eabe4d..f058d68fc3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5401,7 +5401,6 @@ StartupXLOG(void) XLogCtlInsert *Insert; CheckPoint checkPoint; bool wasShutdown; - bool didCrash; bool haveTblspcMap; bool haveBackupLabel; XLogRecPtr EndOfLog; @@ -5521,10 +5520,7 @@ StartupXLOG(void) { RemoveTempXlogFiles(); SyncDataDirectory(); - didCrash = true; } - else - didCrash = false; /* * Prepare for WAL recovery if needed. @@ -5645,14 +5641,8 @@ StartupXLOG(void) * * NB: Restoring replication slot stats relies on slot state to have * already been restored from disk. - * - * TODO: With a bit of extra work we could just start with a pgstat file - * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo); lastFullPageWrites = checkPoint.fullPageWrites; @@ -7244,6 +7234,15 @@ CreateCheckPoint(int flags) */ END_CRIT_SECTION(); + /* + * The control file update is done, hence it is time to write some fresh + * statistics. Stats are flushed at shutdown by the checkpointer in a + * dedicated before_shmem_exit callback, combined with sanity checks + * related to the shutdown of pgstats. + */ + if (!shutdown) + pgstat_flush_stats(RedoRecPtr); + /* * WAL summaries end when the next XLOG_CHECKPOINT_REDO or * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point @@ -7713,6 +7712,15 @@ CreateRestartPoint(int flags) } LWLockRelease(ControlFileLock); + /* + * The control file update is done, hence it is time to write some fresh + * statistics. Stats are flushed at shutdown by the checkpointer in a + * dedicated before_shmem_exit callback, combined with sanity checks + * related to the shutdown of pgstats. + */ + if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0) + pgstat_flush_stats(RedoRecPtr); + /* * Update the average distance between checkpoints/restartpoints if the * prior checkpoint exists. diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 81484222cf..0b02353359 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -10,9 +10,10 @@ * statistics. * * Statistics are loaded from the filesystem during startup (by the startup - * process), unless preceded by a crash, in which case all stats are - * discarded. They are written out by the
Re: Flush pgstats file during checkpoints
Hi, On Tue, Jul 30, 2024 at 03:25:31PM +0900, Michael Paquier wrote: > On Mon, Jul 29, 2024 at 04:46:17AM +, Bertrand Drouvot wrote: > > Thanks! 0001 attached is > > v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch > > so I guess you did not attached the right one. > > I did attach the right set of patches, please ignore 0001 entirely: > the patch series is made of three patches, beginning with 0002 :) Yeah, saw that ;-) > > Looking at 0002: > > > >if (!read_chunk(fpin, ptr, info->shared_data_len)) > > + { > > + elog(WARNING, "could not read data of stats kind %d for entry > > of type %c", > > + kind, t); > > > > Nit: what about to include the "info->shared_data_len" value in the WARNING? > > Good idea, so added. Thanks! > > Looking at 0003: LGTM > > Looking at 0004: LGTM > > Thanks. Attached are the two remaining patches, for now. I'm > planning to get back to these in a few days. Did a quick check and still LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
+ pgstat_write_statsfile(GetRedoRecPtr()); } } @@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_write_statsfile(void) +pgstat_write_statsfile(XLogRecPtr redo) { FILE *fpout; int32 format_id; @@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void) format_id = PGSTAT_FILE_FORMAT_ID; write_chunk_s(fpout, &format_id); + /* Write the redo LSN, used to cross check the file read */ + write_chunk_s(fpout, &redo); + /* Write various stats structs for fixed number of objects */ for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++) { @@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_read_statsfile(void) +pgstat_read_statsfile(XLogRecPtr redo) { FILE *fpin; int32 format_id; bool found; const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME; PgStat_ShmemControl *shmem = pgStatLocal.shmem; + XLogRecPtr file_redo; /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); @@ -1550,6 +1555,22 @@ pgstat_read_statsfile(void) goto error; } + /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo)) + { + elog(WARNING, "could not read redo LSN"); + goto error; + } + + if (file_redo != redo) + { + elog(WARNING, "found incorrect redo LSN %X/%X (expected %X/%X)", + LSN_FORMAT_ARGS(file_redo), LSN_FORMAT_ARGS(redo)); + goto error; + } + /* * We found an existing statistics file. Read it and put all the stats * data into place. -- 2.45.2 From d9dd1206b2895fe96c0766244cd65a1b9861e783 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 17 Jul 2024 12:42:43 +0900 Subject: [PATCH v5 2/2] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the before_shmem_exit() callback to flush the pgstats file in a shutdown sequence is a bit easier than doing so at checkpoint, as they may be stats to flush before we are completely done with the shutdown checkpoint. Keeping the callback also keeps the shutdown of single-user mode simpler, where the stats also need to be flushed. At the beginning of recovery, the stats file is loaded when the redo LSN it stores matches with the point we are replaying from. The file is not removed anymore after being read, to ensure that there is still a source of stats to feed on should the system crash until the next checkpoint that would update the stats. --- src/include/pgstat.h | 4 +- src/backend/access/transam/xlog.c | 30 +--- src/backend/utils/activity/pgstat.c | 68 +-- src/test/recovery/t/029_stats_restart.pl | 26 +-- .../recovery/t/030_stats_cleanup_replica.pl | 2 +- 5 files changed, 59 insertions(+), 71 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 043d39a565..f780b56cc3 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats extern Size StatsShmemSize(void); extern void StatsShmemInit(void); -/* Functions called during server startup / shutdown */ +/* Functions called during server startup / checkpoint / shutdown */ extern void pgstat_restore_stats(XLogRecPtr redo); -extern void pgstat_discard_stats(void); +extern void pgstat_flush_stats(XLogRecPtr redo); extern void pgstat_before_server_shutdown(int code, Datum arg); /* Functions for backend initialization */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b7bb4f9a31..40422a3f3f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5401,7 +5401,6 @@ StartupXLOG(void) XLogCtlInsert *Insert; CheckPoint checkPoint; bool wasShutdown; - bool didCrash; bool haveTblspcMap; bool haveBackupLabel; XLogRecPtr EndOfLog; @@ -5521,10 +5520,7 @@ StartupXLOG(void) { RemoveTempXlogFiles(); SyncDataDirectory(); - didCrash = true; } - else - didCrash = false; /* * Prepare for WAL recovery if needed. @@ -5645,14 +5641,8 @@ StartupXLOG(void) * * NB: Restoring replication slot stats relies on slot state to have * already been restored from disk. - * - * TODO: With a bit of extra work we could just start with a pgstat file - * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo); lastFullPageWrites = checkPoint.fullPageWrites; @@ -7244,6 +7234,15 @@ CreateCheckPoint(int flags) */ END_CRIT_SECTION(); + /* + * The control file update is done, hence it is time to write some fresh + * statistics.
Re: Flush pgstats file during checkpoints
Hi, On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote: > On Mon, Jul 22, 2024 at 07:01:41AM +, Bertrand Drouvot wrote: > > 3 === > > > > + /* > > +* Read the redo LSN stored in the file. > > +*/ > > + if (!read_chunk_s(fpin, &file_redo) || > > + file_redo != redo) > > + goto error; > > > > I wonder if it would make sense to have dedicated error messages for > > "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would > > ease to diagnose as to why the stat file is discarded. > > Yep. This has been itching me quite a bit, and that's a bit more than > just the format ID or the redo LSN: it relates to all the read_chunk() > callers. I've taken a shot at this with patch 0001, implemented on > top of the rest. Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch so I guess you did not attached the right one. > Attaching a new v4 series, with all these comments addressed. Thanks! Looking at 0002: 1 === if (!read_chunk(fpin, ptr, info->shared_data_len)) + { + elog(WARNING, "could not read data of stats kind %d for entry of type %c", + kind, t); Nit: what about to include the "info->shared_data_len" value in the WARNING? 2 === if (!read_chunk_s(fpin, &name)) + { + elog(WARNING, "could not read name of stats kind %d for entry of type %c", +kind, t); goto error; + } if (!pgstat_is_kind_valid(kind)) + { + elog(WARNING, "invalid stats kind %d for entry of type %c", +kind, t); goto error; + } Shouldn't we swap those 2 tests so that we check that the kind is valid right after this one? if (!read_chunk_s(fpin, &kind)) + { + elog(WARNING, "could not read stats kind for entry of type %c", t); goto error; + } Looking at 0003: LGTM Looking at 0004: LGTM Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
; } @@ -1638,8 +1688,8 @@ pgstat_read_statsfile(void) if (found) { dshash_release_lock(pgStatLocal.shared_hash, p); - elog(WARNING, "found duplicate stats entry %d/%u/%u", - key.kind, key.dboid, key.objoid); + elog(WARNING, "found duplicate stats entry %d/%u/%u of type %c", + key.kind, key.dboid, key.objoid, t); goto error; } @@ -1649,7 +1699,11 @@ pgstat_read_statsfile(void) if (!read_chunk(fpin, pgstat_get_entry_data(key.kind, header), pgstat_get_entry_len(key.kind))) + { + elog(WARNING, "could not read data for entry %d/%u/%u of type %c", + key.kind, key.dboid, key.objoid, t); goto error; + } break; } @@ -1660,7 +1714,10 @@ pgstat_read_statsfile(void) * file */ if (fgetc(fpin) != EOF) +{ + elog(WARNING, "could not read end-of-file"); goto error; +} goto done; -- 2.45.2 From 5a4108f1c23619558f205033964a4257545af6a4 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 23 Jul 2024 12:40:08 +0900 Subject: [PATCH v4 3/4] Add redo LSN to pgstats file This is used in the startup process to check that the file we are loading is the one referring to the latest checkpoint. Bump PGSTAT_FILE_FORMAT_ID. --- src/include/pgstat.h| 5 +++-- src/backend/access/transam/xlog.c | 2 +- src/backend/utils/activity/pgstat.c | 35 +++-- 3 files changed, 32 insertions(+), 10 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 6b99bb8aad..043d39a565 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -11,6 +11,7 @@ #ifndef PGSTAT_H #define PGSTAT_H +#include "access/xlogdefs.h" #include "datatype/timestamp.h" #include "portability/instr_time.h" #include "postmaster/pgarch.h" /* for MAX_XFN_CHARS */ @@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus * */ -#define PGSTAT_FILE_FORMAT_ID 0x01A5BCAD +#define PGSTAT_FILE_FORMAT_ID 0x01A5BCAE typedef struct PgStat_ArchiverStats { @@ -466,7 +467,7 @@ extern Size StatsShmemSize(void); extern void StatsShmemInit(void); /* Functions called during server startup / shutdown */ -extern void pgstat_restore_stats(void); +extern void pgstat_restore_stats(XLogRecPtr redo); extern void pgstat_discard_stats(void); extern void pgstat_before_server_shutdown(int code, Datum arg); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d70ba67bac..fa076050f4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5642,7 +5642,7 @@ StartupXLOG(void) if (didCrash) pgstat_discard_stats(); else - pgstat_restore_stats(); + pgstat_restore_stats(checkPoint.redo); lastFullPageWrites = checkPoint.fullPageWrites; diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 4a1724d342..593763eb4c 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -94,6 +94,7 @@ #include #include "access/xact.h" +#include "access/xlog.h" #include "lib/dshash.h" #include "pgstat.h" #include "port/atomics.h" @@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry * -- */ -static void pgstat_write_statsfile(void); -static void pgstat_read_statsfile(void); +static void pgstat_write_statsfile(XLogRecPtr redo); +static void pgstat_read_statsfile(XLogRecPtr redo); static void pgstat_reset_after_failure(void); @@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { * Should only be called by the startup process or in single user mode. */ void -pgstat_restore_stats(void) +pgstat_restore_stats(XLogRecPtr redo) { - pgstat_read_statsfile(); + pgstat_read_statsfile(redo); } /* @@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg) if (code == 0) { pgStatLocal.shmem->is_shutdown = true; - pgstat_write_statsfile(); + pgstat_write_statsfile(GetRedoRecPtr()); } } @@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_write_statsfile(void) +pgstat_write_statsfile(XLogRecPtr redo) { FILE *fpout; int32 format_id; @@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void) format_id = PGSTAT_FILE_FORMAT_ID; write_chunk_s(fpout, &format_id); + /* Write the redo LSN, used to cross check the file read */ + write_chunk_s(fpout, &redo); + /* Write various stats structs for fixed number of objects */ for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++) { @@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_read_statsfile(v
Re: Flush pgstats file during checkpoints
Hi, On Wed, Jul 17, 2024 at 12:52:12PM +0900, Michael Paquier wrote: > On Tue, Jul 16, 2024 at 10:37:39AM +0900, Michael Paquier wrote: > > On Fri, Jul 12, 2024 at 01:01:19PM +, Bertrand Drouvot wrote: > >> Instead of removing the stat file, should we keep it around until the > >> first call > >> to pgstat_write_statsfile()? > > > > Oops. You are right, I have somewhat missed the unlink() once we are > > done reading the stats file with a correct redo location. > > The durable_rename() part has been applied. Please find attached a > rebase of the patch set with all the other comments addressed. Thanks! Looking at 0001: 1 === - pgstat_write_statsfile(); + pgstat_write_statsfile(GetRedoRecPtr()); Not related with your patch but this comment in the GetRedoRecPtr() function: * grabbed a WAL insertion lock to read the authoritative value in * Insert->RedoRecPtr sounds weird. Should'nt that be s/Insert/XLogCtl/? 2 === + /* Write the redo LSN, used to cross check the file loaded */ Nit: s/loaded/read/? 3 === + /* +* Read the redo LSN stored in the file. +*/ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error; I wonder if it would make sense to have dedicated error messages for "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would ease to diagnose as to why the stat file is discarded. Looking at 0002: LGTM Looking at 0003: 4 === @@ -5638,10 +5634,7 @@ StartupXLOG(void) * TODO: With a bit of extra work we could just start with a pgstat file * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo) remove the TODO comment? 5 === + * process) if the stats file has a redo LSN that matches with the . unfinished sentence? 6 === - * Should only be called by the startup process or in single user mode. + * This is called by the checkpointer or in single-user mode. */ void -pgstat_discard_stats(void) +pgstat_flush_stats(XLogRecPtr redo) { Would that make sense to add an Assert in pgstat_flush_stats()? (checking what the above comment states). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
error; + /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error; + /* * We found an existing statistics file. Read it and put all the stats * data into place. -- 2.45.2 From 511205ae260d02dc2542a8594daf16fbbe7184d6 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 11:10:19 +0900 Subject: [PATCH v3 2/3] Add some DEBUG2 information about the redo LSN of the stats file This is useful for.. Debugging. How surprising. --- src/backend/utils/activity/pgstat.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 1b0e64bf82..e739c9823b 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -1367,7 +1367,8 @@ pgstat_write_statsfile(XLogRecPtr redo) /* we're shutting down, so it's ok to just override this */ pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE; - elog(DEBUG2, "writing stats file \"%s\"", statfile); + elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Open the statistics temp file to write out the current values. @@ -1517,7 +1518,8 @@ pgstat_read_statsfile(XLogRecPtr redo) /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); - elog(DEBUG2, "reading stats file \"%s\"", statfile); + elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Try to open the stats file. If it doesn't exist, the backends simply -- 2.45.2 From 52c6b52a5b795f25f7f638fabaa67875f0b08e04 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 17 Jul 2024 12:42:43 +0900 Subject: [PATCH v3 3/3] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the before_shmem_exit() callback to flush the pgstats file in a shutdown sequence is a bit easier than doing so at checkpoint, as they may be stats to flush before we are completely done with the shutdown checkpoint. Keeping the callback also keeps the shutdown of single-user mode simpler, where the stats also need to be flushed. At the beginning of recovery, the stats file is loaded when the redo LSN it stores matches with the point we are replaying from. The file is not removed anymore after being read, to ensure that there is still a source of stats to feed on should the system crash until the next checkpoint that would update the stats. --- src/include/pgstat.h | 4 +- src/backend/access/transam/xlog.c | 27 +--- src/backend/utils/activity/pgstat.c | 65 --- src/test/recovery/t/029_stats_restart.pl | 26 ++-- .../recovery/t/030_stats_cleanup_replica.pl | 2 +- 5 files changed, 57 insertions(+), 67 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 043d39a565..f780b56cc3 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats extern Size StatsShmemSize(void); extern void StatsShmemInit(void); -/* Functions called during server startup / shutdown */ +/* Functions called during server startup / checkpoint / shutdown */ extern void pgstat_restore_stats(XLogRecPtr redo); -extern void pgstat_discard_stats(void); +extern void pgstat_flush_stats(XLogRecPtr redo); extern void pgstat_before_server_shutdown(int code, Datum arg); /* Functions for backend initialization */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f137551e8d..9727b0f20c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5390,7 +5390,6 @@ StartupXLOG(void) XLogCtlInsert *Insert; CheckPoint checkPoint; bool wasShutdown; - bool didCrash; bool haveTblspcMap; bool haveBackupLabel; XLogRecPtr EndOfLog; @@ -5510,10 +5509,7 @@ StartupXLOG(void) { RemoveTempXlogFiles(); SyncDataDirectory(); - didCrash = true; } - else - didCrash = false; /* * Prepare for WAL recovery if needed. @@ -5638,10 +5634,7 @@ StartupXLOG(void) * TODO: With a bit of extra work we could just start with a pgstat file * associated with the checkpoint redo location we're starting from. */ - if (didCrash) - pgstat_discard_stats(); - else - pgstat_restore_stats(checkPoint.redo); + pgstat_restore_stats(checkPoint.redo); lastFullPageWrites = checkPoint.fullPageWrites; @@ -7213,6 +7206,15 @@ CreateCheckPoint(int flags) */ END_CRIT_SECTION(); + /* + * The control file update is done, hence it is time to write some fresh + * statistics. Stats are flushed at shutdown by the checkpo
Re: Flush pgstats file during checkpoints
On Fri, Jul 12, 2024 at 12:10:26PM +, Bertrand Drouvot wrote: > Looking at 0001: > > + /* error logged already */ > > Maybe mention it's already logged by durable_rename() (like it's done in > InstallXLogFileSegment(), BaseBackup() for example). > > Except this nit, 0001 LGTM. Tweaked the comment, and applied 0001 for durable_rename(). Thanks for the review. -- Michael signature.asc Description: PGP signature
Re: Flush pgstats file during checkpoints
On Fri, Jul 12, 2024 at 01:01:19PM +, Bertrand Drouvot wrote: > Instead of removing the stat file, should we keep it around until the first > call > to pgstat_write_statsfile()? Oops. You are right, I have somewhat missed the unlink() once we are done reading the stats file with a correct redo location. -- Michael signature.asc Description: PGP signature
Re: Flush pgstats file during checkpoints
Hi, On Fri, Jul 12, 2024 at 12:10:26PM +, Bertrand Drouvot wrote: > Need to spend more time and thoughts on 0002+. I think there is a corner case, say: 1. shutdown checkpoint at LSN1 2. startup->reads the stat file (contains LSN1)->all good->read stat file and remove it 3. crash (no checkpoint occured between 2. and 3.) 4. startup (last checkpoint is still LSN1)->no stat file (as removed in 2.) In that case we start with empty stats. Instead of removing the stat file, should we keep it around until the first call to pgstat_write_statsfile()? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
Hi, On Fri, Jul 12, 2024 at 03:42:21PM +0900, Michael Paquier wrote: > On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote: > > On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote: > >> I think those are two independent issues - knowing that the snapshot is > >> from the last checkpoint, and knowing that it's correct (not corrupted). > >> And yeah, we should be careful about fsync/durable_rename. > > > > Yeah, that's bugging me as well. I don't really get why we would not > > want durability at shutdown for this data. So how about switching the > > end of pgstat_write_statsfile() to use durable_rename()? Sounds like > > an independent change, worth on its own. > > Please find attached a rebased patch set with the durability point > addressed in 0001. There were also some conflicts. Thanks! Looking at 0001: + /* error logged already */ Maybe mention it's already logged by durable_rename() (like it's done in InstallXLogFileSegment(), BaseBackup() for example). Except this nit, 0001 LGTM. Need to spend more time and thoughts on 0002+. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
Hi, On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote: > On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote: > > I think those are two independent issues - knowing that the snapshot is > > from the last checkpoint, and knowing that it's correct (not corrupted). > > And yeah, we should be careful about fsync/durable_rename. > > Yeah, that's bugging me as well. I don't really get why we would not > want durability at shutdown for this data. So how about switching the > end of pgstat_write_statsfile() to use durable_rename()? Sounds like > an independent change, worth on its own. > > > Yeah, I was wondering about the same thing - can't this mean we fail to > > start autovacuum? Let's say we delete a significant fraction of a huge > > table, but then we crash before the next checkpoint. With this patch we > > restore the last stats snapshot, which can easily have > > > > n_dead_tup | 0 > > n_mod_since_analyze | 0 > > > > for the table. And thus we fail to notice the table needs autovacuum. > > AFAIK we run autovacuum on all tables with missing stats (or am I > > wrong?). That's what's happening on replicas after switchover/failover > > too, right? > > That's the opposite, please look at relation_needs_vacanalyze(). If a > table does not have any stats found in pgstats, like on HEAD after a > crash, then autoanalyze is skipped and autovacuum happens only for the > anti-wrap case. > > For the database case, rebuild_database_list() uses > pgstat_fetch_stat_dbentry() three times, discards entirely databases > that have no stats. Again, there should be a stats entry post a > crash upon a reconnection. > > So there's an argument in saying that the situation does not get worse > here and that we actually may improve odds by keeping a trace of the > previous stats, no? I agree, we still could get autoanalyze/autovacuum skipped due to obsolete/stales stats being restored from the last checkpoint but that's better than having them always skipped (current HEAD). > In most cases, there would be a stats entry > post-crash as an INSERT or a scan would have re-created it, Yeap. > but the > stats would reflect the state registered since the last crash recovery > (even on HEAD, a bulk delete bumping the dead tuple counter would not > be detected post-crash). Right. > The case of spiky workloads may impact the > decision-making, of course, but at least we'd allow autovacuum to take > some decision over giving up entirely based on some previous state of > the stats. That sounds like a win for me with steady workloads, less > for bulby workloads.. I agree and it is not worst (though not ideal) that currently on HEAD. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Flush pgstats file during checkpoints
single user mode. */ void -pgstat_restore_stats(void) +pgstat_restore_stats(XLogRecPtr redo) { - pgstat_read_statsfile(); + pgstat_read_statsfile(redo); } /* @@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg) if (code == 0) { pgStatLocal.shmem->is_shutdown = true; - pgstat_write_statsfile(); + pgstat_write_statsfile(GetRedoRecPtr()); } } @@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_write_statsfile(void) +pgstat_write_statsfile(XLogRecPtr redo) { FILE *fpout; int32 format_id; @@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void) format_id = PGSTAT_FILE_FORMAT_ID; write_chunk_s(fpout, &format_id); + /* Write the redo LSN, used to cross check the file loaded */ + write_chunk_s(fpout, &redo); + /* Write various stats structs for fixed number of objects */ for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++) { @@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_read_statsfile(void) +pgstat_read_statsfile(XLogRecPtr redo) { FILE *fpin; int32 format_id; bool found; const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME; PgStat_ShmemControl *shmem = pgStatLocal.shmem; + XLogRecPtr file_redo; /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); @@ -1541,6 +1546,13 @@ pgstat_read_statsfile(void) format_id != PGSTAT_FILE_FORMAT_ID) goto error; + /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error; + /* * We found an existing statistics file. Read it and put all the stats * data into place. -- 2.45.2 From 54e5253ee94229ea7c0c39960c9f10d305c1ef21 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 11:10:19 +0900 Subject: [PATCH v2 3/4] Add some DEBUG2 information about the redo LSN of the stats file This is useful for.. Debugging. How surprising. --- src/backend/utils/activity/pgstat.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 2952604a51..148dbf2c4a 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -1367,7 +1367,8 @@ pgstat_write_statsfile(XLogRecPtr redo) /* we're shutting down, so it's ok to just override this */ pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE; - elog(DEBUG2, "writing stats file \"%s\"", statfile); + elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Open the statistics temp file to write out the current values. @@ -1517,7 +1518,8 @@ pgstat_read_statsfile(XLogRecPtr redo) /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); - elog(DEBUG2, "reading stats file \"%s\"", statfile); + elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Try to open the stats file. If it doesn't exist, the backends simply -- 2.45.2 From beda72b0ee48dd62ed60205a9cb4ca1496b45da5 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 14:49:33 +0900 Subject: [PATCH v2 4/4] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the before_shmem_exit() callback to flush the pgstats file in a shutdown sequence is a bit easier than doing so at checkpoint, as they may be stats to flush before we are completely done with the shutdown checkpoint. Keeping the callback also keeps the shutdown of single-user mode simpler, where the stats also need to be flushed. At the beginning of recovery, the stats file is loaded when the redo LSN it stores matches with the point we are replaying from. --- src/include/pgstat.h | 4 +- src/backend/access/transam/xlog.c| 27 --- src/backend/utils/activity/pgstat.c | 62 ++-- src/test/recovery/t/029_stats_restart.pl | 22 +++-- 4 files changed, 54 insertions(+), 61 deletions(-) diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 043d39a565..f780b56cc3 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats extern Size StatsShmemSize(void); extern void StatsShmemInit(void); -/* Functions called during server startup / shutdown */ +/* Functions called during server startup / checkpoint / shutdown */ extern void pgstat_restore_stats(XLogRecPtr redo); -extern void pgstat_discard_stats(void); +extern void pgstat_flush_stats(XLogRecPtr redo); ext
Re: Flush pgstats file during checkpoints
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote: > I think those are two independent issues - knowing that the snapshot is > from the last checkpoint, and knowing that it's correct (not corrupted). > And yeah, we should be careful about fsync/durable_rename. Yeah, that's bugging me as well. I don't really get why we would not want durability at shutdown for this data. So how about switching the end of pgstat_write_statsfile() to use durable_rename()? Sounds like an independent change, worth on its own. > Yeah, I was wondering about the same thing - can't this mean we fail to > start autovacuum? Let's say we delete a significant fraction of a huge > table, but then we crash before the next checkpoint. With this patch we > restore the last stats snapshot, which can easily have > > n_dead_tup | 0 > n_mod_since_analyze | 0 > > for the table. And thus we fail to notice the table needs autovacuum. > AFAIK we run autovacuum on all tables with missing stats (or am I > wrong?). That's what's happening on replicas after switchover/failover > too, right? That's the opposite, please look at relation_needs_vacanalyze(). If a table does not have any stats found in pgstats, like on HEAD after a crash, then autoanalyze is skipped and autovacuum happens only for the anti-wrap case. For the database case, rebuild_database_list() uses pgstat_fetch_stat_dbentry() three times, discards entirely databases that have no stats. Again, there should be a stats entry post a crash upon a reconnection. So there's an argument in saying that the situation does not get worse here and that we actually may improve odds by keeping a trace of the previous stats, no? In most cases, there would be a stats entry post-crash as an INSERT or a scan would have re-created it, but the stats would reflect the state registered since the last crash recovery (even on HEAD, a bulk delete bumping the dead tuple counter would not be detected post-crash). The case of spiky workloads may impact the decision-making, of course, but at least we'd allow autovacuum to take some decision over giving up entirely based on some previous state of the stats. That sounds like a win for me with steady workloads, less for bulby workloads.. > It'd not be such an issue if we updated stats during recovery, but I > think think we're doing that. Perhaps we should, which might also help > on replicas - no idea if it's feasible, though. Stats on replicas are considered an independent thing AFAIU (scans are counted for example in read-only queries). If we were to do that we may want to split stats handling between nodes in standby state and crash recovery. Not sure if that's worth the complication. First, the stats exist at node-level. -- Michael signature.asc Description: PGP signature
Re: Flush pgstats file during checkpoints
On 6/28/24 09:32, Konstantin Knizhnik wrote: > > On 18/06/2024 9:01 am, Michael Paquier wrote: >> Hi all, >> >> On HEAD, xlog.c has the following comment, which has been on my own >> TODO list for a couple of weeks now: >> >> * TODO: With a bit of extra work we could just start with a >> pgstat file >> * associated with the checkpoint redo location we're starting from. >> >> Please find a patch series to implement that, giving the possibility >> to keep statistics after a crash rather than discard them. I have >> been looking at the code for a while, before settling down on: >> - Forcing the flush of the pgstats file to happen during non-shutdown >> checkpoint and restart points, after updating the control file's redo >> LSN and the critical sections in the area. >> - Leaving the current before_shmem_exit() callback around, as a matter >> of delaying the flush of the stats for as long as possible in a >> shutdown sequence. This also makes the single-user mode shutdown >> simpler. >> - Adding the redo LSN to the pgstats file, with a bump of >> PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change >> is independently useful on its own when loading stats after a clean >> startup, as well. >> - The crash recovery case is simplified, as there is no more need for >> the "discard" code path. >> - Not using a logic where I would need to stick a LSN into the stats >> file name, implying that we would need a potential lookup at the >> contents of pg_stat/ to clean up past files at crash recovery. These >> should not be costly, but I'd rather not add more of these. >> >> 7ff23c6d277d, that has removed the last call of CreateCheckPoint() >> from the startup process, is older than 5891c7a8ed8f, still it seems >> to me that pgstats relies on some areas of the code that don't make >> sense on HEAD (see locking mentioned at the top of the write routine >> for example). The logic gets straight-forward to think about as >> restart points and checkpoints always run from the checkpointer, >> implying that pgstat_write_statsfile() is already called only from the >> postmaster in single-user mode or the checkpointer itself, at >> shutdown. >> >> Attached is a patch set, with the one being the actual feature, with >> some stuff prior to that: >> - 0001 adds the redo LSN to the pgstats file flushed. >> - 0002 adds an assertion in pgstat_write_statsfile(), to check from >> where it is called. >> - 0003 with more debugging. >> - 0004 is the meat of the thread. >> >> I am adding that to the next CF. Thoughts and comments are welcome. >> Thanks, >> -- >> Michael > > Hi Michael. > > I am working mostly on the same problem - persisting pgstat state in > Neon (because of separation of storage and compute it has no local files). > I have two questions concerning this PR and the whole strategy for > saving pgstat state between sessions. > > 1. Right now pgstat.stat is discarded after abnormal Postgres > termination. And in your PR we are storing LSN in pgstat.staty to check > that it matches checkpoint redo LSN. I wonder if having outdated version > of pgstat.stat is worse than not having it at all? Comment in xlog.c > says: "When starting with crash recovery, reset pgstat data - it might > not be valid." But why it may be invalid? We are writing it first to > temp file and then rename. May be fsync() should be added here and > durable_rename() should be used instead of rename(). But it seems to be > better than loosing statistics. And should not add some significant > overhead (because it happens only at shutdown). In your case we are > checking LSN of pgstat.stat file. But once again - why it is better to > discard file than load version from previous checkpoint? > I think those are two independent issues - knowing that the snapshot is from the last checkpoint, and knowing that it's correct (not corrupted). And yeah, we should be careful about fsync/durable_rename. > 2. Second question is also related with 1). So we saved pgstat.stat on > checkpoint, then did some updates and then Postgres crashed. As far as I > understand with your patch we will successfully restore pgstats.stat > file. But it is not actually up-to-date: it doesn't reflect information > about recent updates. If it was so critical to keep pgstat.stat > up-to-date, then why do we allow to restore state on most recent > checkpoint? > Yeah, I was wondering about the same thing - can't this mean we fail to start autovacuum? Let's say we delete a significant fraction of a huge table, but then we crash before the next checkpoint. With this patch we restore the last stats snapshot, which can easily have n_dead_tup | 0 n_mod_since_analyze | 0 for the table. And thus we fail to notice the table needs autovacuum. AFAIK we run autovacuum on all tables with missing stats (or am I wrong?). That's what's happening on replicas after switchover/failover too, right? It'd not be such an issue if we updated stats during recovery, b
Re: Flush pgstats file during checkpoints
On 18/06/2024 9:01 am, Michael Paquier wrote: Hi all, On HEAD, xlog.c has the following comment, which has been on my own TODO list for a couple of weeks now: * TODO: With a bit of extra work we could just start with a pgstat file * associated with the checkpoint redo location we're starting from. Please find a patch series to implement that, giving the possibility to keep statistics after a crash rather than discard them. I have been looking at the code for a while, before settling down on: - Forcing the flush of the pgstats file to happen during non-shutdown checkpoint and restart points, after updating the control file's redo LSN and the critical sections in the area. - Leaving the current before_shmem_exit() callback around, as a matter of delaying the flush of the stats for as long as possible in a shutdown sequence. This also makes the single-user mode shutdown simpler. - Adding the redo LSN to the pgstats file, with a bump of PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change is independently useful on its own when loading stats after a clean startup, as well. - The crash recovery case is simplified, as there is no more need for the "discard" code path. - Not using a logic where I would need to stick a LSN into the stats file name, implying that we would need a potential lookup at the contents of pg_stat/ to clean up past files at crash recovery. These should not be costly, but I'd rather not add more of these. 7ff23c6d277d, that has removed the last call of CreateCheckPoint() from the startup process, is older than 5891c7a8ed8f, still it seems to me that pgstats relies on some areas of the code that don't make sense on HEAD (see locking mentioned at the top of the write routine for example). The logic gets straight-forward to think about as restart points and checkpoints always run from the checkpointer, implying that pgstat_write_statsfile() is already called only from the postmaster in single-user mode or the checkpointer itself, at shutdown. Attached is a patch set, with the one being the actual feature, with some stuff prior to that: - 0001 adds the redo LSN to the pgstats file flushed. - 0002 adds an assertion in pgstat_write_statsfile(), to check from where it is called. - 0003 with more debugging. - 0004 is the meat of the thread. I am adding that to the next CF. Thoughts and comments are welcome. Thanks, -- Michael Hi Michael. I am working mostly on the same problem - persisting pgstat state in Neon (because of separation of storage and compute it has no local files). I have two questions concerning this PR and the whole strategy for saving pgstat state between sessions. 1. Right now pgstat.stat is discarded after abnormal Postgres termination. And in your PR we are storing LSN in pgstat.staty to check that it matches checkpoint redo LSN. I wonder if having outdated version of pgstat.stat is worse than not having it at all? Comment in xlog.c says: "When starting with crash recovery, reset pgstat data - it might not be valid." But why it may be invalid? We are writing it first to temp file and then rename. May be fsync() should be added here and durable_rename() should be used instead of rename(). But it seems to be better than loosing statistics. And should not add some significant overhead (because it happens only at shutdown). In your case we are checking LSN of pgstat.stat file. But once again - why it is better to discard file than load version from previous checkpoint? 2. Second question is also related with 1). So we saved pgstat.stat on checkpoint, then did some updates and then Postgres crashed. As far as I understand with your patch we will successfully restore pgstats.stat file. But it is not actually up-to-date: it doesn't reflect information about recent updates. If it was so critical to keep pgstat.stat up-to-date, then why do we allow to restore state on most recent checkpoint? Thanks, Konstantin
Flush pgstats file during checkpoints
+static void pgstat_read_statsfile(XLogRecPtr redo); static void pgstat_reset_after_failure(void); @@ -404,9 +405,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = { * Should only be called by the startup process or in single user mode. */ void -pgstat_restore_stats(void) +pgstat_restore_stats(XLogRecPtr redo) { - pgstat_read_statsfile(); + pgstat_read_statsfile(redo); } /* @@ -482,7 +483,7 @@ pgstat_before_server_shutdown(int code, Datum arg) if (code == 0) { pgStatLocal.shmem->is_shutdown = true; - pgstat_write_statsfile(); + pgstat_write_statsfile(GetRedoRecPtr()); } } @@ -1305,7 +1306,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_write_statsfile(void) +pgstat_write_statsfile(XLogRecPtr redo) { FILE *fpout; int32 format_id; @@ -1340,6 +1341,9 @@ pgstat_write_statsfile(void) format_id = PGSTAT_FILE_FORMAT_ID; write_chunk_s(fpout, &format_id); + /* Write the redo LSN, used to cross check the file loaded */ + write_chunk_s(fpout, &redo); + /* * XXX: The following could now be generalized to just iterate over * pgstat_kind_infos instead of knowing about the different kinds of @@ -1480,13 +1484,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len) * stats so locking is not required. */ static void -pgstat_read_statsfile(void) +pgstat_read_statsfile(XLogRecPtr redo) { FILE *fpin; int32 format_id; bool found; const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME; PgStat_ShmemControl *shmem = pgStatLocal.shmem; + XLogRecPtr file_redo; /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); @@ -1520,6 +1525,13 @@ pgstat_read_statsfile(void) format_id != PGSTAT_FILE_FORMAT_ID) goto error; + /* + * Read the redo LSN stored in the file. + */ + if (!read_chunk_s(fpin, &file_redo) || + file_redo != redo) + goto error; + /* * XXX: The following could now be generalized to just iterate over * pgstat_kind_infos instead of knowing about the different kinds of -- 2.45.1 From 9b252f56d1cad3e12001829bbb61c51cc742e9e0 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 10:59:31 +0900 Subject: [PATCH 2/4] Add assertion in pgstat_write_statsfile() This routine can currently only be called from the postmaster in single-user mode or the checkpointer, so make sure that this is always the case. --- src/backend/utils/activity/pgstat.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 824f742bde..9cdd986582 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -1317,6 +1317,9 @@ pgstat_write_statsfile(XLogRecPtr redo) pgstat_assert_is_up(); + /* should be called only by the checkpointer or single user mode */ + Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER); + /* we're shutting down, so it's ok to just override this */ pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE; -- 2.45.1 From ed7fe49cf147b09def2b78554a989d592e02fe07 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 11:10:19 +0900 Subject: [PATCH 3/4] Add some DEBUG2 information about the redo LSN of the stats file This is useful for.. Debugging. How surprising. --- src/backend/utils/activity/pgstat.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c index 9cdd986582..855dec9e52 100644 --- a/src/backend/utils/activity/pgstat.c +++ b/src/backend/utils/activity/pgstat.c @@ -1323,7 +1323,8 @@ pgstat_write_statsfile(XLogRecPtr redo) /* we're shutting down, so it's ok to just override this */ pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE; - elog(DEBUG2, "writing stats file \"%s\"", statfile); + elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Open the statistics temp file to write out the current values. @@ -1499,7 +1500,8 @@ pgstat_read_statsfile(XLogRecPtr redo) /* shouldn't be called from postmaster */ Assert(IsUnderPostmaster || !IsPostmasterEnvironment); - elog(DEBUG2, "reading stats file \"%s\"", statfile); + elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile, + LSN_FORMAT_ARGS(redo)); /* * Try to open the stats file. If it doesn't exist, the backends simply -- 2.45.1 From d206073d3cc6a0a5ed9228cdd3089c24d88f51f6 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 18 Jun 2024 14:49:33 +0900 Subject: [PATCH 4/4] Flush pgstats file during checkpoints The flushes are done for non-shutdown checkpoints and restart points, so as it is possible to keep some statistics around in the event of a crash. Keeping the