Re: Changing the state of data checksums in a running cluster

Palak Chaturvedi Wed, 01 Jul 2026 22:06:50 -0700

Hi Tomas,

Thanks for these test patches. They are quite useful. I've been going
through them because we think they'll also help with the TAP tests for the
online shared_buffers resize work , where I'm ending up reusing several of
the same helpers (attach_injection_points, background_rw_pgbench, and a
couple of others).
Would you be open to lifting these helpers out into a shared Perl module
(for example under src/test/perl/PostgreSQL/Test/…) so both test suites can
consume them instead of each copy-pasting? I'm happy to prepare a small
prep patch on top of your latest version if that sounds reasonable to you.


Thanks,
Palak

On Mon, 4 May 2026 at 18:47, Tomas Vondra <[email protected]> wrote:

> Hi,
>
> Thanks for getting this feature pushed, and for resolving the failures
> reported since the feature freeze. I consider this to be an important
> improvement, not just for the feature itself, but also because of all
> the useful infrastructure it added.
>
> Attached is a refined version of the TAP tests already posted by Daniel
> some time ago [1]. Unfortunately, that .txt did not apply cleanly for
> some reason, so here's a better version.
>
> I found these tests quite useful when reasoning about how the patch
> behaves in concurrent environment (e.g. with multiple sessions
> triggering checksum enable/disable, or with a checkpoint, crashes, etc).
>
> At this point all the tests pass, but there are a couple cases with
> correct but slightly surprising behavior, worth discussing. Which is
> what this e-mail is going to be about.
>
> I'll explain what the TAP tests aim to do first, and then discuss the
> slightly surprising behavior.
>
> It's not meant for inclusion into PG19, at least not in this shape - I
> wrote those TAP tests while investigating some of the earlier failures
> and/or when wondering about behavior in various situations (sequence of
> concurrent steps, race conditions, ...). So it's more of an exhaustive,
> and the tests are somewhat redundant (N+1 is often just (N + some small
> tweak)).
>
> I can imagine distilling it into a tiny subset, and adding that. But
> that's up to discussion. But that's for later.
>
>
> Let me briefly explain what the various TAP tests aim to do. From the
> very beginning, my main concern regarding this patch was race conditions
> when updating the shared state about effective data_checksum_version.
> Because the state is effectively split into about three or four places:
>
> * LocalDataChecksumVersion (local cache)
> * XLogCtl->data_checksum_version (XLogCtl->info_lck)
> * ControlFile->data_checksum_version (ControlFileLock)
> * state in control file on disk
>
> These pieces are protected by different locks, the protocol for updating
> and/or reading the various flags is not trivial (and some of the fixed
> issues were due to ControlFile->data_checksum_version being updated from
> a place that shouldn't have).
>
> So the primary goal of the TAP tests was to check for race conditions by
> leveraging injection points to step through concurrent processes in a
> deterministic way. The first couple patches (0001-0004) add debug
> logging and injection points into a lot of places. And by "a lot" I mean
> ~80 new injection points, which is about the number of injection points
> we have in master now. Anyway, this allows stepping through concurrent
> checksum changes, and also checksum change vs. checkpointer.
>
> Then come the actual TAP tests:
>
> 1) 0005-TAP-10-concurrent-checksum-changes.patch
>
> Two concurrent checksum changes. The first one gets paused at an
> injection point, then the second one gets initiated.
>
> 2) 0006-TAP-11-concurrency-with-checkpoints.patch
>
> A checksum change + checkpoint. The change gets paused at an injection
> point, a synchronous checkpoint is performed.
>
> 3) 0007-TAP-12-crashes-at-injection-points.patch
>
> Similar to 0006, but with a crash + recovery. A checksum change gets
> paused at an injection point, a synchronous checkpoint is performed. The
> changes gets wpken up and either completes, or pauses on a different
> injection point. A restart/crash happens.
>
> 4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch
>
> Similar to 0007, but the checkpoint is not synchronous - happens in the
> background, so that the TAP can step through both sides and interleave
> them in an arbitrary way. This matters because the checksum change
> updates the different state pieces (XlogCtl/ControlFile), while the
> checkpointer reads them to record initial state for REDO etc.
>
> 5) 0009-TAP-14-checkpoints-with-crashes.patch
>
> Similar to 0008, except that the steps are more fine grained, and
> focused on two particular cases with surprisingly different final state.
>
>
> AFAIK everything works as expected, except for two cases in the "TAP
> 012" test. One for the "enabling" direction, one for the "disabling"
> direction. I'm going to discuss the "enabling" direction, I believe the
> other case is just a mirror with the same root cause.
>
> The TAP 012 tests checksum change with a concurrent checkpoint, followed
> by a crash, and tests the final state. It pauses the change at an
> injection point, does a checkpoint, proceeds to the next injection
> point, crashes and does recovery. The expectation is that the final
> state "flips" at some injection point, once it gets further enough, and
> stays there. But what actually happens is this:
>
> a) test_checksum_transition(
>     'disabled', 'enable', undef,
>     'datachecksums-enable-inprogress-checksums-end',
>     'datachecksums-enable-checksums-start',
>     'off');
>
> b) test_checksum_transition(
>     'disabled', 'enable', undef,
>     'datachecksums-enable-checksums-start',
>     'datachecksums-enable-checksums-after-xlog',
>     'on');
>
> c) test_checksum_transition(
>     'disabled', 'enable', 'datachecksums-enable-checksums-start',
>     'datachecksums-enable-checksums-after-xlogs',
>     'datachecksums-enable-checksums-after-xlogctl',
>     'off');
>
> This says that if the checkpoint happens after
> 'datachecksums-enable-inprogress-checksums-end' or after
> 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
> enabling checksums fails).
>
> But if the checkpoint happens after
> 'datachecksums-enable-checksums-start', we end up with "on" (after
> recovery).
>
> This is a bit surprising, because that injection point is before
> 'datachecksums-enable-checksums-after-xlog'. So the enabling process
> gets further and further, but the final state flips off -> on -> off,
> contradicting the expectation that it changes once.
>
> I haven't quite wrapped my head around it yet, but my understanding is
> this is due to a race condition between the checksum launcher (writing
> XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
> (reading the shmem state and generating REDO).
>
> The launcher does this sequence of steps:
>
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> while the checkpointer (CreateCheckPoint) does this:
>
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
>
> The outcome depends on how exactly these two sequences interleave. For
> example, this can happen:
>
> 1) write XLOG2_CHECKSUMS with new state
>   A) read XLogCtl->data_checksum_version (while holding insert locks)
>   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
>   C) UpdateControlFile()
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
> so redo won't see it), but the checkpoint will still get the old
> checksum state from XLogCtl. And so the outcome is "off", per case (c).
>
> But it can also happen what case (b) does:
>
>   A) read XLogCtl->data_checksum_version (while holding insert locks)
>   B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
>   C) UpdateControlFile()
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> In which case the REDO will have the old state, but the recovery will
> read the XLOG2_CHECKSUMS, and so end up with "on".
>
> This is the root cause of the surprising behavior in TAP 012, I think.
>
> I attempted to trigger these race conditions in TAP 013, but without
> much success. In the end I realized it probably needs more control,
> waiting for the other process to hit the next injection point before
> unpausing the current one. TAP 014 does that, and it shows that with the
> right interleaving of steps the (c) case can end up with both "on" and
> "off" final state.
>
> As I said, I don't claim I fully understand this yet. But I wouldn't
> call this "bug" - AFAICS it won't produce an incorrect final state (I
> haven't seen any such cases).
>
> Still, I wonder if there's a potential issue I failed to notice.
>
>
> The other question I had when looking at this (concurrency with
> checkpoints) is what we get by doing
>
>      MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>
> whenever updating the state in SetDataChecksums... functions. Because
> the only thing that guarantees is the updates happen on one side of the
> checkpoint record. What does that give us, actually?
>
> It does not seem to prevent this surprising behavior, and it does not
> say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
>
>
> regards
>
> [1]
>
> https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se
>
> --
> Tomas Vondra
>

Re: Changing the state of data checksums in a running cluster

Reply via email to