Hi, I spent a bit more time looking at this today, and I figured out a simpler way to "cause trouble" by using PITR. I don't know if this has the same root cause as the failures in the 006 TAP test, but I find it interesting and I think it highlights some issues with the new patch.
The 006 test enables/disables checksums while running pgbench in the background, and randomly kills the instance in different ways (restart with fast/immediate mode). It then checks if the recovery hits any checksum errors during redo. And every now and then it sees failures, but it's tedious to investigate that because a lot can happen between the checksum state change and the crash, and it's not clear at which point it actually got broken. I realized I can do a simpler thing. I can enable WAL archiving, run the pgbench, enable/disable checksums, etc. And then I can do PITR into different places in the WAL, possibly going record by record, and verify checksums in every of those places. That should make it much more deterministic than the "random" 006 test. And it really does. So I did that, mostly like this: 1) setup instance with WAL archiving enabled 2) create basebackup 3) initialize pgbench 4) run read-write pgbench in the background 5) disable checksums 5) wait for data_checksums to change to "off" 6) stop the instance Then I look for CHECKSUMS records in the WAL using pg_waldump, which can tell look something like this: ... lsn: 0/66707368, prev 0/66707318, desc: COMMIT 2026-02-06 ... lsn: 0/66707390, prev 0/66707368, desc: CHECKSUMS inprogress-off lsn: 0/667073B0, prev 0/66707390, desc: LOCK xmax: 48107, off: ... ... lsn: 0/66715238, prev 0/66715200, desc: HOT_UPDATE old_xmax: ... lsn: 0/66715288, prev 0/66715238, desc: CHECKSUMS off lsn: 0/667152A8, prev 0/66715288, desc: HOT_UPDATE old_xmax: ... ... And then do PITR to each of those LSNs (or any other interesting LSN) using this: recovery_target_lsn = '$LSN' recovery_target_action = 'shutdown' And once the instance shuts down, I can verify checksums on the data directory using pg_checksums. And for these LSNs listed above I get: 0/66707368 - OK 0/66707390 - OK 0/667073B0 - OK 0/66715238 - OK 0/66715288 - 16155 failures 0/667152A8 - 15948 failures There's a couple interesting details/questions here: 1) It seems a bit surprising we can run pg_checksums even after the checksums flip to "off" at LSN 0/66715288. AFAICS this is a direct consequence of separating this from checkpoints, but checkpoints are still responsible for writing the state into the control file. But during redo we don't generate new checkpoints, so we get into a state when the control file still says "checksums on", but the data files may already contain pages without correct checksums. FWIW the other direction (when enabling checksums) can end up in a similar "disagreement". The control file will still say "off" (or maybe "inprogress-on") while the in-memory state will say "on". But I guess that's harmless, as it won't cause checksum failures. Or maybe it can cause some other issues, not sure. I'm not sure what to do about this. The control file is updated only lazily, but e.g. pg_checksums relies on it not being stale. Or at least not stale "too much". The last patch ensured we have a checkpoint for each state change, i.e. we can't go through both (on -> inprogress-off) and (inprogress-off -> off) within a single checkpoint interval. And that would prevent this issue, AFAIK. If we updated the control file to say "inprogress-off" at some point, pg_checksums would know not to try to verify checksums. Maybe there are other issues, though. Having two places determining the checksum state of an instance, and allowing them to get out of sync in some way seems a bit tricky. 2) I don't understand how applying a single WAL record can trigger so many checksum failures. Going from 0/66715238 to 0/66715288, which applies the XLOG_CHECKSUMS record, triggered ~16k failures. How come? That doesn't even touch any pages, AFAICS. Similarly, applying the single HOT_UPDATE at 0/667152A8 (which per pg_waldump touches only a single block) makes ~200 failures to go away. I'm sure there is a simple explanation for this, but it's puzzling. regards -- Tomas Vondra
