On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <and...@anarazel.de> wrote: > > The failure has to be happening in wait_for_postmaster_promote(), because > > the > > standby2 is actually successfully promoted. > > I assume this is ext4. Presumably anything that reads the > controlfile, like pg_ctl, pg_checksums, pg_resetwal, > pg_control_system(), ... by reading without interlocking against > writes could see garbage. I have lost track of the versions and the > thread, but I worked out at some point by experimentation that this > only started relatively recently for concurrent read() and write(), > but always happened with concurrent pread() and pwrite(). The control > file uses the non-p variants which didn't mash old/new data like > grated cheese under concurrency due to some implementation detail, but > now does.
As for what to do about it, some ideas: 1. Use advisory range locking. (This would be an advisory version of what many other filesystems do automatically, AFAIK. Does Windows have a thing like POSIX file locking, or need it here?) 2. Retry after a short time on checksum failure. The probability is already miniscule, and becomes pretty close to 0 if we read thrice 100ms apart. 3. Some scheme that involves renaming the file into place. (That might be a pain on Windows; it only works for the relmap thing because all readers and writers are in the backend and use an LWLock to avoid silly handle semantics.) 4. ??? First thought is that 2 is appropriate level of complexity for this rare and stupid problem.