Re: Changing the state of data checksums in a running cluster

Heikki Linnakangas Tue, 17 Mar 2026 04:45:47 -0700

Here's a quick, non-exhaustive review of v20260316 of the patch. Ihaven't followed the thread, so forgive me if some of these things havealready been discussed.


I don't see any major issues, just a bunch of small things:

+/*
+ * Checksum version 0 is used for when data checksums are disabled (OFF).
+ * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
+ * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
+ * checksums are either currently being enabled or disabled.
+ */
+typedef enum ChecksumType
+{
+       PG_DATA_CHECKSUM_OFF = 0,
+       PG_DATA_CHECKSUM_VERSION,
+       PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
+       PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION
+} ChecksumType;

Naming: This is called "VERSION", but also "Type", and I'm a littleconfused. I think the field was called "version" because we envisionedthat we might want to use a different checksum algorithm in the future,which could then be indicated by a different checksum version number. Ifwe do that in the future, how will these values look like?

@@ -831,9 +852,10 @@ XLogInsertRecord(XLogRecData *rdata,
                 * only happen just after a checkpoint, so it's better to be 
slow in
                 * this case and fast otherwise.
                 *
-                * Also check to see if fullPageWrites was just turned on or 
there's a
-                * running backup (which forces full-page writes); if we weren't
-                * already doing full-page writes then go back and recompute.
+                * Also check to see if fullPageWrites was just turned on, 
there's a
+                * running backup or if checksums are enabled (all of which 
forces
+                * full-page writes); if we weren't already doing full-page 
writes
+                * then go back and recompute.
                 *
                 * If we aren't doing full-page writes then RedoRecPtr doesn't
                 * actually affect the contents of the XLOG record, so we'll 
update

It's true that if checksums are enabled, we must do full-page writes.But that's already included in the "fullpageWrites" flag. We're notspecifically checking whether checksums are enabled here, so I find thiscomment change more confusing than helpful.

void
SetDataChecksumsOnInProgress(void)
{
        uint64          barrier;

        Assert(ControlFile != NULL);

        /*
         * The state transition is performed in a critical section with
         * checkpoints held off to provide crash safety.
         */
        START_CRIT_SECTION();
        MyProc->delayChkptFlags |= DELAY_CHKPT_START;

        XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);

        SpinLockAcquire(&XLogCtl->info_lck);
        XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
        SpinLockRelease(&XLogCtl->info_lck);

        barrier = 
EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);

        MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
        END_CRIT_SECTION();

        /*
         * Update the controlfile before waiting since if we have an immediate
         * shutdown while waiting we want to come back up with checksums 
enabled.
         */
        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
        ControlFile->data_checksum_version = 
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
        UpdateControlFile();
        LWLockRelease(ControlFileLock);

        /*
         * Await state change in all backends to ensure that all backends are in
         * "inprogress-on". Once done we know that all backends are writing data
         * checksums.
         */
        WaitForProcSignalBarrier(barrier);
}

If a checkpoint starts, finishes, and you then crash, all between theEND_CRIT_SECTION() and LWLockAcquire here, are you left with with wrong'data_checksum_version' in the control file? (Highly theoretical, thewindow is minuscule, but still)

+/*
+ * InitLocalControlData
+ *
+ * Set up backend local caches of controldata variables which may change at
+ * any point during runtime and thus require special cased locking. So far
+ * this only applies to data_checksum_version, but it's intended to be general
+ * purpose enough to handle future cases.
+ */
+void
+InitLocalDataChecksumVersion(void)
+{
+       SpinLockAcquire(&XLogCtl->info_lck);
+       SetLocalDataChecksumVersion(XLogCtl->data_checksum_version);
+       SpinLockRelease(&XLogCtl->info_lck);
+}


Function name in the comment doesn't match.

+               ereport(WARNING,
+                               errmsg("data checksums state has been set to 
off"),
+                               errhint("If checksums were being enabled during 
shutdown then processing must be manually restarted."));

This is a little unclear. If we reach this, we know that the checkumcalculation was interrupted, no need to caveat it with "if checksumswere being enabled", right? Maybe something like:


WARNING: enabling data checksums was interrupted
HINT: The checksum processing must be manually restarted

I feel we're missing a term for the process of enabling checkums."checksum enablement"? "initial checksum calculation"? In some commentsit's called "checksumming".

 * The DataChecksumsWorker will compile a list of databases which exist at the
 * start of checksumming, and once all are processed will regenerate the list
 * and start over processing any new entries. Once there are no new entries on
 * the list, processing will end.  All databases MUST BE successfully processed
 * in order for data checksums to be enabled, the only exception are databases
 * which are dropped before having been processed.

Why does the list need to be regenerated? Any newly created databaseswould already have checksums enabled, right? Is this because you mightuse database that hadn't been fully processed yet as the template? Wouldbe good to mention if that's the reason. Could CREATE DATABASE calculatethe checksums when it copies the files?

 * means that connected backends will change state at different times.  If
 * waiting for a barrier is done during startup, for example during replay, it
 * is important to realize that any locks held by the startup process might
 * cause deadlocks if backends end up waiting for those locks while startup
 * is waiting for a procsignalbarrier.

Don't we absorb procsignalbarriers while waiting on a lock? Or does thismean LWlocks?

 * Backends transition Bd -> Bi via a procsignalbarrier which is emitted by the
 * DataChecksumsLauncher.  When all backends have acknowledged the barrier then
 * Bd will be empty and the next phase can begin: calculating and writing data
 * checksums with DataChecksumsWorkers.  When the DataChecksumsWorker processes
 * have finished writing checksums on all pages, data checksums are enabled
 * cluster-wide via another procsignalbarrier. There are four sets of backends
 * where Bd shall be an empty set:
 *
 * Bg: Backend updating the global state and emitting the procsignalbarrier
 * Bd: Backends in "off" state
 * Be: Backends in "on" state
 * Bi: Backends in "inprogress-on" state
 *
 * Backends in Bi and Be will write checksums when modifying a page, but only
 * backends in Be will verify the checksum during reading. The Bg backend is
 * blocked waiting for all backends in Bi to process interrupts and move to
 * Be. Any backend starting while Bg is waiting on the procsignalbarrier will
 * observe the global state being "on" and will thus automatically belong to
 * Be.  Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
 * are compatible sets while still operating based on their local state as
 * both write data checksums.

Isn't "Bg" always the data checksums launcher process? In the previousparagraph before this, you didn't list the data checksums launcher as aseparate "set".

/*
 * Configuration of conditions which must match when absorbing a procsignal
 * barrier during data checksum enable/disable operations.  A single function
 * is used for absorbing all barriers, and the set of conditions to use is
 * looked up in the checksum_barriers struct.  The struct member for the target
 * state defines which state the backend must currently be in, and which it
 * must not be in.
 *
 * The reason for this explicit checking is to ensure that processing cannot
 * be started such that it breaks the assumptions of the state machine.  See
 * datachecksumsworker.c for a lengthy discussion on these states.

This is "datachecksumworker.c", aka "datachecksum_state.c", so this "Seedatachecksumworker.c" refers to itself.

 *
 * MAX_BARRIER_CONDITIONS must match largest number of sets in barrier_eq and
 * barrier_ne in the below checksum_barriers definition.
 */
#define MAX_BARRIER_CONDITIONS 2
typedef struct ChecksumBarrierCondition
{
        /* The target state of the barrier */
        int                     target;
        /* A set of states in which at least one MUST match the current state */
        int                     barrier_eq[MAX_BARRIER_CONDITIONS];
        /* The number of elements in the barrier_eq set */
        int                     barrier_eq_sz;
        /* A set of states which all MUST NOT match the current state */
        int                     barrier_ne[MAX_BARRIER_CONDITIONS];
        /* The number of elements in the barrier_ne set */
        int                     barrier_ne_sz;
} ChecksumBarrierCondition;

I don't understand the logic here. Why do you need both 'barrier_eq' and'barrier_ne'? Wouldn't it be enough to just list the allowed states, andeverything else is not allowed?

static const ChecksumBarrierCondition checksum_barriers[4] =
{
        {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, 
PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2, {PG_DATA_CHECKSUM_VERSION}, 1},
        {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, 
{0}, 0},
        {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {0}, 0, 
{PG_DATA_CHECKSUM_VERSION}, 1},
        {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 
1, {0}, 0},
};

This is pretty jarring to read. Maybe use C99 designated initializers,or at least split over multiple lines. Or a plain array of of allowedtransitions, (current, target) -> bool. Or a function with switch-casestatement.

        /*
         * If the postmaster crashed we cannot end up with a processed database 
so
         * we have no alternative other than exiting. When enabling checksums we
         * won't at this time have changed the pg_control version to enabled so
         * when the cluster comes back up processing will have to be restarted.
         * When disabling, the pg_control version will be set to off before this
         * so when the cluster comes up checksums will be off as expected.
         */
        if (status == BGWH_POSTMASTER_DIED)
                ereport(FATAL,
                                errcode(ERRCODE_ADMIN_SHUTDOWN),
                                errmsg("cannot enable data checksums without the 
postmaster process"),
                                errhint("Restart the database and restart data 
checksum processing by calling pg_enable_data_checksums()."));

Do we reach here when disabling checksums? The comment suggests so, butin that case the hint would be wrong.

/*
 * launcher_exit
 *
 * Internal routine for cleaning up state when the launcher process exits. We
 * need to clean up the abort flag to ensure that processing started again if
 * it was previously aborted (note: started again, *not* restarted from where
 * it left off).
 */
static void
launcher_exit(int code, Datum arg)
{
        if (launcher_running)
        {
                LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
                launcher_running = false;
                DataChecksumsWorkerShmem->launcher_running = false;

                /*
                 * TODO: how to really handle the worker still running when the
                 * launcher exits?
                 */
                if (DataChecksumsWorkerShmem->worker_running)
                        ereport(LOG,
                                        errmsg("data checksums launcher exiting 
while worker is still running"));
                LWLockRelease(DataChecksumsWorkerLock);
        }
}


What "abort flag"?

        /*
         * Get a list of all databases to process. This may include databases 
that
         * were created during our runtime.  Since a database can be created as 
a
         * copy of any other database (which may not have existed in our last
         * run), we have to repeat this loop until no new databases show up in 
the
         * list.  Here the initial list for the loop processing is generated 
after
         * waiting for all existing transactions to finish to ensure that we can
         * see any database which was created even if the transaction in which 
it
         * was created started before checksums were being processed.
         */
        WaitForAllTransactionsToFinish();
        DatabaseList = BuildDatabaseList();

What if a database is dropped and another one is created with the samedatabase OID?

                /*
                 * This is the only place where we check if we are asked to 
abort, the
                 * abortion will bubble up from here. It's safe to check this 
without
                 * a lock, because if we miss it being set, we will try again 
soon.
                 */
                Assert(operation == ENABLE_DATACHECKSUMS);
                LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
                if (DataChecksumsWorkerShmem->launch_operation == 
DISABLE_DATACHECKSUMS)
                        abort_requested = true;
                LWLockRelease(DataChecksumsWorkerLock);

The comment justifies that it's safe to do this without a lock, but itgrabs the lock anyway.

@@ -107,7 +107,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int 
flags, bool *checksum_fail
         */
        if (!PageIsNew(page))
        {
-               if (DataChecksumsEnabled())
+               HOLD_INTERRUPTS();
+               if (DataChecksumsNeedVerify())
                {
                        checksum = pg_checksum_page(page, blkno);

@@ -118,6 +119,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail

                                        *checksum_failure_p = true;
                        }
                }
+               RESUME_INTERRUPTS();

/*

                 * The following checks don't prove the header is correct, only 
that

This is to ensure that we don't absorb a barrier while calculating thechecksum, which could transition us from "on" to "off" ? This isprobably not really needed because there are no CHECK_FOR_INTERRUPTS()calls in pg_checksum_page(), but better safe than sorry I guess. Anotherapproach would be to check DataChecksumsNeedVerify() again if thechecksum didn't match. In any case, a comment would be in order.


s/datachecksumsworker.c/datachecksum_state.c/

DataChecksumsOnInProgress() and DataChecksumsOffInProgress() are unused.

Attached are a few more trivial fixes, in patch format.

- Heikki

commit 2c2ecde0e10577b7ae3d671aa5b2331345887253
Author: Heikki Linnakangas <[email protected]>
Date:   Tue Mar 17 13:23:45 2026 +0200

    fix typos

diff --git a/src/backend/postmaster/datachecksum_state.c 
b/src/backend/postmaster/datachecksum_state.c
index 2384be7f657..3f3b11bb721 100644
--- a/src/backend/postmaster/datachecksum_state.c
+++ b/src/backend/postmaster/datachecksum_state.c
@@ -4,7 +4,7 @@
  *       Background worker for enabling or disabling data checksums online as
  *       well as functionality for manipulating data checksum state
  *
- * When enabling data checksums on a database at initdb time or when shut down
+ * When enabling data checksums on a cluster at initdb time or when shut down
  * with pg_checksums, no extra process is required as each page is checksummed,
  * and verified, when accessed.  When enabling checksums on an already running
  * cluster, this worker will ensure that all pages are checksummed before
@@ -21,7 +21,7 @@
  * "inprogress-on" which signals that write operations MUST compute and write
  * the checksum on the data page, but during reading the checksum SHALL NOT be
  * verified. This ensures that all objects created during when checksums are
- * being enabled will have checksums set, but reads wont fail due to missing or
+ * being enabled will have checksums set, but reads won't fail due to missing 
or
  * invalid checksums. Invalid checksums can be present in case the cluster had
  * checksums enabled, then disabled them and updated the page while they were
  * disabled.
@@ -32,7 +32,7 @@
  * the list, processing will end.  All databases MUST BE successfully processed
  * in order for data checksums to be enabled, the only exception are databases
  * which are dropped before having been processed.
-
+ *
  * Any new relation in a processed database, created during processing, will
  * see the in-progress state and will automatically be checksummed.
  *
@@ -46,13 +46,13 @@
  * 2. Disabling checksums
  * ----------------------
  * When disabling checksums, data_checksums will be set to "inprogress-off"
- * which signals that checksums are written but no longer verified. This ensure
- * that backends which have yet to move from the "on" state will still be able
- * to process data checksum validation.
+ * which signals that checksums are written but no longer need to be verified.
+ * This ensures that backends which have not yet transitioned to the
+ * "inprogress-off" state will still see valid checksums on pages.
  *
  * 3. Synchronization and Correctness
  * ----------------------------------
- * The processes involved in enabling, or disabling, data checksums in an
+ * The processes involved in enabling or disabling data checksums in an
  * online cluster must be properly synchronized with the normal backends
  * serving concurrent queries to ensure correctness. Correctness is defined
  * as the following:
@@ -74,8 +74,8 @@
  * latter with ensuring that any concurrent activity cannot break the data
  * checksum contract during processing.
  *
- * Synchronizing the state change is done with procsignal barriers, before
- * updating the controlfile with the state all other backends must absorb the
+ * Synchronizing the state change is done with procsignal barriers. Before
+ * updating the data_checksums state in the control file, all other backends 
must absorb the
  * barrier.  Barrier absorption will happen during interrupt processing, which
  * means that connected backends will change state at different times.  If
  * waiting for a barrier is done during startup, for example during replay, it
@@ -644,7 +644,7 @@ ProcessSingleRelationFork(Relation reln, ForkNumber 
forkNum, BufferAccessStrateg
                return false;
 
        /* Report the current relation to pgstat_activity */
-       snprintf(activity, sizeof(activity) - 1, "processing: %s.%s (%s, 
%dblocks)",
+       snprintf(activity, sizeof(activity) - 1, "processing: %s.%s (%s, %u 
blocks)",
                         relns, RelationGetRelationName(reln), 
forkNames[forkNum], numblocks);
        pgstat_report_activity(STATE_RUNNING, activity);
        pgstat_progress_update_param(PROGRESS_DATACHECKSUMS_BLOCKS_TOTAL, 
numblocks);
@@ -658,7 +658,7 @@ ProcessSingleRelationFork(Relation reln, ForkNumber 
forkNum, BufferAccessStrateg
        {
                Buffer          buf = ReadBufferExtended(reln, forkNum, blknum, 
RBM_NORMAL, strategy);
 
-               /* Need to get an exclusive lock before we can flag as dirty */
+               /* Need to get an exclusive lock to mark the buffer as dirty */
                LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
                /*
@@ -666,7 +666,7 @@ ProcessSingleRelationFork(Relation reln, ForkNumber 
forkNum, BufferAccessStrateg
                 * re-write the page to WAL even if the checksum hasn't changed,
                 * because if there is a replica it might have a slightly 
different
                 * version of the page with an invalid checksum, caused by 
unlogged
-                * changes (e.g. hintbits) on the master happening while 
checksums
+                * changes (e.g. hintbits) on the primary happening while 
checksums
                 * were off. This can happen if there was a valid checksum on 
the page
                 * at one point in the past, so only when checksums are first 
on, then
                 * off, and then turned on again.  TODO: investigate if this 
could be

Re: Changing the state of data checksums in a running cluster

Reply via email to