Hi, On Thu, Apr 23, 2026 at 3:16 AM Daniel Gustafsson <[email protected]> wrote:
> While performing extensive post-commit testing using the built-in TAP > tests in > checksum_extended mode, Tomas observed a couple issues. The testing was > done > on multiple machines, ranging from a fast x86 machine to a much slower > rpi4, > but they all performed a very large number of iterations. The combination > of > hardware and the large number of tests executed is likely why some of the > issues went unnoticed. > > Below are the issues described in more detail: > > 1) race condition between checkpoint and checksum state changes > > The main issue is a race condition causing invalid state transitions. > When analyzing the failure (which we couldn't synthetically reproduce, only > observe by running tests over and over for a long time) we realized that > the > original coding was performing checksum state transitions while replaying > online checkpoints as well as redo checkpoints. > > Updating checksum state while replaying online checkpoints is incorrect. > After > a crash we'll start with the REDO record, which initializes checksums to > the > right state. And then later the checksum state is updated by the regular > XLOG_CHECKSUMS entries. > > Moreover, the checksum state in the XLOG_CHECKPOINT_ONLINE record can be > stale, > because the value is determined at the checkpoint start, but the WAL entry > is added at the end. If there is a concurrent checksum change, the value > written to the WAL record will be stale. Replaying it will cause an > "invalid > transition" error later, during the next checksum state update. The fix is > to > remove the checksum update from the online checkpoint altogether. > > In fact, there's no need for CreateCheckPoint to update checksum state > in the ControlFile. If the value is stale, it could make it permanent. > But it's unnecessary - the ControlFile is updated by the process > performing the checksum state change. So remove that. > > This, along with an re-ordering updating the controlfile and procsignal > barrier > enission fixes the issue. The re-ordering makes sure that the controlfile > is > always updated *before* a procsignalbarrier is emitted to avoid a race > like the > one described below: > > 1. A barrier for off to inprogress-on is emitted > 2. All active backends absorbs the barrier > - All processes in the cluster are in state inprogress-on > 3. A new backend b' forks, reads controlfile and sets state of off > 4. The controlfile is updated > 5. A new backend b'' forks, reads controlfile and sets state to > inprogress-on > > b' and b'' have different states, and b' has an incompatible state with the > rest of the cluster. Re-ordering as done in the attached makes this go > away. > > > 2) race condition in launcher exit > > Another timing related issue was that reverting to the "off" state then > launcher errors out had synchronization logic which was racy as it was > relying > on the cached checksum state and not the value from XLogCtl. The logic for > determining if a launcher/worker was already active was also fragile as it > started another launcher which would overwrite certain data in shared > memory. > The attached patch inspects shared memory instead and use that to signal > the > running launcher to either abort (disable), or change cost parameters on a > running enable process. These fixes makes erroring out and going to off > state > stable. > > > 3) Concurrency issue with ProcSignalInit / InitLocalDataChecksumState > > The checksum barriers must not be consumed before the initial value gets > properly set. On very slow systems, there could potentially be multiple > checksum state transitions between a fork and InitLocalDataChecksumState. > In > such cases we might get failures due to incorrect transitions. > > With the current code this is not a live issue, as there is no place > checking > interrupts in between the two functions. But it's fragile, as it's > trivial to > break this by adding an elog() somewhere. Which is what happened to us > while > debugging the other issues. So better to explicitly hold interrupts for a > brief moment. > > To find the issues, and to validate their fix, Tomas developed a new > testsuite > which is attached as a .txt. This is not proposed for adding to v19, it is > included to showcase what was done, and what will be further hacked on for > a > new suite during the v20 cycle. It is gated behind PG_TEST_EXTRA and is > intended to be executed by select members of the buildfarm. > > As part of this postcommit review we also found a few more cleanups and > smaller > fixes which are included. The patchset also includes the patch submitted > upthread by Satyanarayana Narlapuram. > > More details can be found in the individual commit messages. > > Unless there are objections I would like to go ahead with this fixup > fairly soon. > All the patches applied cleanly and the tests passing. A few minor comments: In SetDataChecksumsOff, stale comment, only INPROGRESS_OFF can reach the else section /* * Ending up here implies that the checksums state is "inprogress-on" * or "inprogress-off" and we can transition directly to "off" from * there. */ SpinLockRelease(&XLogCtl->info_lck); Should we update ControlFile->data_checksum_version at the end-of-recovery? If not, the disk is stale compared to in memory until the next checkpoint. The code two lines below updates the control file anyways to set the DB_IN_PRODUCTION. Maybe combine the update with that? It's no big deal if we don't do it because it will be self correct but tools like pg_controldata give stale value for some time. Thanks, Satya
