Re: Changing the state of data checksums in a running cluster

Tomas Vondra Fri, 06 Feb 2026 09:16:23 -0800

Hi,

I spent a bit more time looking at this today, and I figured out a
simpler way to "cause trouble" by using PITR. I don't know if this has
the same root cause as the failures in the 006 TAP test, but I find it
interesting and I think it highlights some issues with the new patch.


The 006 test enables/disables checksums while running pgbench in the
background, and randomly kills the instance in different ways (restart
with fast/immediate mode). It then checks if the recovery hits any
checksum errors during redo. And every now and then it sees failures,
but it's tedious to investigate that because a lot can happen between
the checksum state change and the crash, and it's not clear at which
point it actually got broken.

I realized I can do a simpler thing. I can enable WAL archiving, run the
pgbench, enable/disable checksums, etc. And then I can do PITR into
different places in the WAL, possibly going record by record, and verify
checksums in every of those places. That should make it much more
deterministic than the "random" 006 test. And it really does.

So I did that, mostly like this:

1) setup instance with WAL archiving enabled
2) create basebackup
3) initialize pgbench
4) run read-write pgbench in the background
5) disable checksums
5) wait for data_checksums to change to "off"
6) stop the instance

Then I look for CHECKSUMS records in the WAL using pg_waldump, which can
tell look something like this:

  ...
  lsn: 0/66707368, prev 0/66707318, desc: COMMIT 2026-02-06 ...
  lsn: 0/66707390, prev 0/66707368, desc: CHECKSUMS inprogress-off
  lsn: 0/667073B0, prev 0/66707390, desc: LOCK xmax: 48107, off: ...
  ...
  lsn: 0/66715238, prev 0/66715200, desc: HOT_UPDATE old_xmax: ...
  lsn: 0/66715288, prev 0/66715238, desc: CHECKSUMS off
  lsn: 0/667152A8, prev 0/66715288, desc: HOT_UPDATE old_xmax: ...
  ...

And then do PITR to each of those LSNs (or any other interesting LSN)
using this:

  recovery_target_lsn = '$LSN'
  recovery_target_action = 'shutdown'

And once the instance shuts down, I can verify checksums on the data
directory using pg_checksums. And for these LSNs listed above I get:

  0/66707368 - OK
  0/66707390 - OK
  0/667073B0 - OK
  0/66715238 - OK
  0/66715288 - 16155 failures
  0/667152A8 - 15948 failures

There's a couple interesting details/questions here:

1) It seems a bit surprising we can run pg_checksums even after the
checksums flip to "off" at LSN 0/66715288. AFAICS this is a direct
consequence of separating this from checkpoints, but checkpoints are
still responsible for writing the state into the control file.

But during redo we don't generate new checkpoints, so we get into a
state when the control file still says "checksums on", but the data
files may already contain pages without correct checksums.

FWIW the other direction (when enabling checksums) can end up in a
similar "disagreement". The control file will still say "off" (or maybe
"inprogress-on") while the in-memory state will say "on". But I guess
that's harmless, as it won't cause checksum failures. Or maybe it can
cause some other issues, not sure.

I'm not sure what to do about this. The control file is updated only
lazily, but e.g. pg_checksums relies on it not being stale. Or at least
not stale "too much". The last patch ensured we have a checkpoint for
each state change, i.e. we can't go through both (on -> inprogress-off)
and (inprogress-off -> off) within a single checkpoint interval.

And that would prevent this issue, AFAIK. If we updated the control file
to say "inprogress-off" at some point, pg_checksums would know not to
try to verify checksums.

Maybe there are other issues, though. Having two places determining the
checksum state of an instance, and allowing them to get out of sync in
some way seems a bit tricky.

2) I don't understand how applying a single WAL record can trigger so
many checksum failures. Going from 0/66715238 to 0/66715288, which
applies the XLOG_CHECKSUMS record, triggered ~16k failures. How come?
That doesn't even touch any pages, AFAICS.

Similarly, applying the single HOT_UPDATE at 0/667152A8 (which per
pg_waldump touches only a single block) makes ~200 failures to go away.

I'm sure there is a simple explanation for this, but it's puzzling.


regards

-- 
Tomas Vondra

Re: Changing the state of data checksums in a running cluster

Reply via email to