Hi Tomas, Thanks for these test patches. They are quite useful. I've been going through them because we think they'll also help with the TAP tests for the online shared_buffers resize work , where I'm ending up reusing several of the same helpers (attach_injection_points, background_rw_pgbench, and a couple of others). Would you be open to lifting these helpers out into a shared Perl module (for example under src/test/perl/PostgreSQL/Test/…) so both test suites can consume them instead of each copy-pasting? I'm happy to prepare a small prep patch on top of your latest version if that sounds reasonable to you.
Thanks, Palak On Mon, 4 May 2026 at 18:47, Tomas Vondra <[email protected]> wrote: > Hi, > > Thanks for getting this feature pushed, and for resolving the failures > reported since the feature freeze. I consider this to be an important > improvement, not just for the feature itself, but also because of all > the useful infrastructure it added. > > Attached is a refined version of the TAP tests already posted by Daniel > some time ago [1]. Unfortunately, that .txt did not apply cleanly for > some reason, so here's a better version. > > I found these tests quite useful when reasoning about how the patch > behaves in concurrent environment (e.g. with multiple sessions > triggering checksum enable/disable, or with a checkpoint, crashes, etc). > > At this point all the tests pass, but there are a couple cases with > correct but slightly surprising behavior, worth discussing. Which is > what this e-mail is going to be about. > > I'll explain what the TAP tests aim to do first, and then discuss the > slightly surprising behavior. > > It's not meant for inclusion into PG19, at least not in this shape - I > wrote those TAP tests while investigating some of the earlier failures > and/or when wondering about behavior in various situations (sequence of > concurrent steps, race conditions, ...). So it's more of an exhaustive, > and the tests are somewhat redundant (N+1 is often just (N + some small > tweak)). > > I can imagine distilling it into a tiny subset, and adding that. But > that's up to discussion. But that's for later. > > > Let me briefly explain what the various TAP tests aim to do. From the > very beginning, my main concern regarding this patch was race conditions > when updating the shared state about effective data_checksum_version. > Because the state is effectively split into about three or four places: > > * LocalDataChecksumVersion (local cache) > * XLogCtl->data_checksum_version (XLogCtl->info_lck) > * ControlFile->data_checksum_version (ControlFileLock) > * state in control file on disk > > These pieces are protected by different locks, the protocol for updating > and/or reading the various flags is not trivial (and some of the fixed > issues were due to ControlFile->data_checksum_version being updated from > a place that shouldn't have). > > So the primary goal of the TAP tests was to check for race conditions by > leveraging injection points to step through concurrent processes in a > deterministic way. The first couple patches (0001-0004) add debug > logging and injection points into a lot of places. And by "a lot" I mean > ~80 new injection points, which is about the number of injection points > we have in master now. Anyway, this allows stepping through concurrent > checksum changes, and also checksum change vs. checkpointer. > > Then come the actual TAP tests: > > 1) 0005-TAP-10-concurrent-checksum-changes.patch > > Two concurrent checksum changes. The first one gets paused at an > injection point, then the second one gets initiated. > > 2) 0006-TAP-11-concurrency-with-checkpoints.patch > > A checksum change + checkpoint. The change gets paused at an injection > point, a synchronous checkpoint is performed. > > 3) 0007-TAP-12-crashes-at-injection-points.patch > > Similar to 0006, but with a crash + recovery. A checksum change gets > paused at an injection point, a synchronous checkpoint is performed. The > changes gets wpken up and either completes, or pauses on a different > injection point. A restart/crash happens. > > 4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch > > Similar to 0007, but the checkpoint is not synchronous - happens in the > background, so that the TAP can step through both sides and interleave > them in an arbitrary way. This matters because the checksum change > updates the different state pieces (XlogCtl/ControlFile), while the > checkpointer reads them to record initial state for REDO etc. > > 5) 0009-TAP-14-checkpoints-with-crashes.patch > > Similar to 0008, except that the steps are more fine grained, and > focused on two particular cases with surprisingly different final state. > > > AFAIK everything works as expected, except for two cases in the "TAP > 012" test. One for the "enabling" direction, one for the "disabling" > direction. I'm going to discuss the "enabling" direction, I believe the > other case is just a mirror with the same root cause. > > The TAP 012 tests checksum change with a concurrent checkpoint, followed > by a crash, and tests the final state. It pauses the change at an > injection point, does a checkpoint, proceeds to the next injection > point, crashes and does recovery. The expectation is that the final > state "flips" at some injection point, once it gets further enough, and > stays there. But what actually happens is this: > > a) test_checksum_transition( > 'disabled', 'enable', undef, > 'datachecksums-enable-inprogress-checksums-end', > 'datachecksums-enable-checksums-start', > 'off'); > > b) test_checksum_transition( > 'disabled', 'enable', undef, > 'datachecksums-enable-checksums-start', > 'datachecksums-enable-checksums-after-xlog', > 'on'); > > c) test_checksum_transition( > 'disabled', 'enable', 'datachecksums-enable-checksums-start', > 'datachecksums-enable-checksums-after-xlogs', > 'datachecksums-enable-checksums-after-xlogctl', > 'off'); > > This says that if the checkpoint happens after > 'datachecksums-enable-inprogress-checksums-end' or after > 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e. > enabling checksums fails). > > But if the checkpoint happens after > 'datachecksums-enable-checksums-start', we end up with "on" (after > recovery). > > This is a bit surprising, because that injection point is before > 'datachecksums-enable-checksums-after-xlog'. So the enabling process > gets further and further, but the final state flips off -> on -> off, > contradicting the expectation that it changes once. > > I haven't quite wrapped my head around it yet, but my understanding is > this is due to a race condition between the checksum launcher (writing > XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer > (reading the shmem state and generating REDO). > > The launcher does this sequence of steps: > > 1) write XLOG2_CHECKSUMS with new state > 2) update XLogCtl->data_checksum_version > 3) update ControlFile->data_checksum_version > 4) UpdateControlFile() > 5) emits barrier > > while the checkpointer (CreateCheckPoint) does this: > > A) read XLogCtl->data_checksum_version (while holding insert locks) > B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version) > C) UpdateControlFile() > > The outcome depends on how exactly these two sequences interleave. For > example, this can happen: > > 1) write XLOG2_CHECKSUMS with new state > A) read XLogCtl->data_checksum_version (while holding insert locks) > B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version) > C) UpdateControlFile() > 2) update XLogCtl->data_checksum_version > 3) update ControlFile->data_checksum_version > 4) UpdateControlFile() > 5) emits barrier > > Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and > so redo won't see it), but the checkpoint will still get the old > checksum state from XLogCtl. And so the outcome is "off", per case (c). > > But it can also happen what case (b) does: > > A) read XLogCtl->data_checksum_version (while holding insert locks) > B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version) > C) UpdateControlFile() > 1) write XLOG2_CHECKSUMS with new state > 2) update XLogCtl->data_checksum_version > 3) update ControlFile->data_checksum_version > 4) UpdateControlFile() > 5) emits barrier > > In which case the REDO will have the old state, but the recovery will > read the XLOG2_CHECKSUMS, and so end up with "on". > > This is the root cause of the surprising behavior in TAP 012, I think. > > I attempted to trigger these race conditions in TAP 013, but without > much success. In the end I realized it probably needs more control, > waiting for the other process to hit the next injection point before > unpausing the current one. TAP 014 does that, and it shows that with the > right interleaving of steps the (c) case can end up with both "on" and > "off" final state. > > As I said, I don't claim I fully understand this yet. But I wouldn't > call this "bug" - AFAICS it won't produce an incorrect final state (I > haven't seen any such cases). > > Still, I wonder if there's a potential issue I failed to notice. > > > The other question I had when looking at this (concurrency with > checkpoints) is what we get by doing > > MyProc->delayChkptFlags |= DELAY_CHKPT_START; > > whenever updating the state in SetDataChecksums... functions. Because > the only thing that guarantees is the updates happen on one side of the > checkpoint record. What does that give us, actually? > > It does not seem to prevent this surprising behavior, and it does not > say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO. > > > regards > > [1] > > https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se > > -- > Tomas Vondra >
