Re: Changing the state of data checksums in a running cluster

SATYANARAYANA NARLAPURAM Tue, 28 Apr 2026 13:26:00 -0700

Hi,

On Thu, Apr 23, 2026 at 3:16 AM Daniel Gustafsson <[email protected]> wrote:


> While performing extensive post-commit testing using the built-in TAP
> tests in
> checksum_extended mode, Tomas observed a couple issues.  The testing was
> done
> on multiple machines, ranging from a fast x86 machine to a much slower
> rpi4,
> but they all performed a very large number of iterations.  The combination
> of
> hardware and the large number of tests executed is likely why some of the
> issues went unnoticed.
>
> Below are the issues described in more detail:
>
> 1) race condition between checkpoint and checksum state changes
>
> The main issue is a race condition causing invalid state transitions.
> When analyzing the failure (which we couldn't synthetically reproduce, only
> observe by running tests over and over for a long time) we realized that
> the
> original coding was performing checksum state transitions while replaying
> online checkpoints as well as redo checkpoints.
>
> Updating checksum state while replaying online checkpoints is incorrect.
> After
> a crash we'll start with the REDO record, which initializes checksums to
> the
> right state.  And then later the checksum state is updated by the regular
> XLOG_CHECKSUMS entries.
>
> Moreover, the checksum state in the XLOG_CHECKPOINT_ONLINE record can be
> stale,
> because the value is determined at the checkpoint start, but the WAL entry
> is added at the end.  If there is a concurrent checksum change, the value
> written to the WAL record will be stale.  Replaying it will cause an
> "invalid
> transition" error later, during the next checksum state update. The fix is
> to
> remove the checksum update from the online checkpoint altogether.
>
> In fact, there's no need for CreateCheckPoint to update checksum state
> in the ControlFile. If the value is stale, it could make it permanent.
> But it's unnecessary - the ControlFile is updated by the process
> performing the checksum state change. So remove that.
>
> This, along with an re-ordering updating the controlfile and procsignal
> barrier
> enission fixes the issue.  The re-ordering makes sure that the controlfile
> is
> always updated *before* a procsignalbarrier is emitted to avoid a race
> like the
> one described below:
>
> 1. A barrier for off to inprogress-on is emitted
> 2. All active backends absorbs the barrier
>   - All processes in the cluster are in state inprogress-on
> 3. A new backend b' forks, reads controlfile and sets state of off
> 4. The controlfile is updated
> 5. A new backend b'' forks, reads controlfile and sets state to
> inprogress-on
>
> b' and b'' have different states, and b' has an incompatible state with the
> rest of the cluster.  Re-ordering as done in the attached makes this go
> away.
>
>
> 2) race condition in launcher exit
>
> Another timing related issue was that reverting to the "off" state then
> launcher errors out had synchronization logic which was racy as it was
> relying
> on the cached checksum state and not the value from XLogCtl.  The logic for
> determining if a launcher/worker was already active was also fragile as it
> started another launcher which would overwrite certain data in shared
> memory.
> The attached patch inspects shared memory instead and use that to signal
> the
> running launcher to either abort (disable), or change cost parameters on a
> running enable process.  These fixes makes erroring out and going to off
> state
> stable.
>
>
> 3) Concurrency issue with ProcSignalInit / InitLocalDataChecksumState
>
> The checksum barriers must not be consumed before the initial value gets
> properly set.  On very slow systems, there could potentially be multiple
> checksum state transitions between a fork and InitLocalDataChecksumState.
> In
> such cases we might get failures due to incorrect transitions.
>
> With the current code this is not a live issue, as there is no place
> checking
> interrupts in between the two functions.  But it's fragile, as it's
> trivial to
> break this by adding an elog() somewhere.  Which is what happened to us
> while
> debugging the other issues.  So better to explicitly hold interrupts for a
> brief moment.
>
> To find the issues, and to validate their fix, Tomas developed a new
> testsuite
> which is attached as a .txt.  This is not proposed for adding to v19, it is
> included to showcase what was done, and what will be further hacked on for
> a
> new suite during the v20 cycle.  It is gated behind PG_TEST_EXTRA and is
> intended to be executed by select members of the buildfarm.
>
> As part of this postcommit review we also found a few more cleanups and
> smaller
> fixes which are included.  The patchset also includes the patch submitted
> upthread by Satyanarayana Narlapuram.
>
> More details can be found in the individual commit messages.
>
> Unless there are objections I would like to go ahead with this fixup
> fairly soon.
>

All the patches applied cleanly and the tests passing. A few minor comments:


In SetDataChecksumsOff, stale comment, only INPROGRESS_OFF can reach the
else section
/*
* Ending up here implies that the checksums state is "inprogress-on"
* or "inprogress-off" and we can transition directly to "off" from
* there.
*/
SpinLockRelease(&XLogCtl->info_lck);


Should we update ControlFile->data_checksum_version at the end-of-recovery?
If not, the disk
is stale compared to in memory until the next checkpoint. The code two
lines below updates
the control file anyways to set the DB_IN_PRODUCTION. Maybe combine the
update with that?
It's no big deal if we don't do it because it will be self correct but
tools like pg_controldata
give stale value for some time.


Thanks,
Satya

Re: Changing the state of data checksums in a running cluster

Reply via email to