Hi,

On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote:
> > I've been working on a fix for this (still pretty raw though).  What a mess.
> >
> >
> > As part of validating that I added error checks that disallow reads in the
> > startup process for anything other than the FSM fork.  Which triggered, due 
> > to
> > visibilitymap_prepare_truncate() reading a VM page.
> >
> > Which in turn made me wonder: Is it actually OK for
> > visibilitymap_prepare_truncate() to do separate WAL logging from
> > XLOG_SMGR_TRUNCATE?
> >
> > I suspect not, consider this scenario:
> >
> > 1) primary does XLOG_FPI for the VM page
> > 2) checkpoint
> > 3) primary does XLOG_SMGR_TRUNCATE
> > 4) standby replays XLOG_FPI
> > 5) primary performs a restartpoint
> > 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
> >    visibilitymap_prepare_truncate, which dirties the VM page
> > 7) standby crashes while writing out the VM page
> > 8) standby does recovery and now finds a torn VM page, triggering a checksum
> >    failure
> >
> > Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd 
> > just
> > wipe out that region of the VM?
> >
>
> Shouldn't it be possible to test this using injection points?

Not trivially, I think?  For one, you're not normally going to get a torn
page, even if you kill the server in the right moment, so you're rarely going
to be able to reproduce checksum errors without further infrastructure.  But
also, since the failure behaviour is going to be to clear the VM, you'd not
really see bad consequences, I think?


I think we really need some logic during WAL replay that does more
verification of pages that are read in (i.e. do not have an FPI applied) by
the startup process.

- Direct uses XLogReadBufferExtended() should only be allowed with
  ZERO_ON_ERROR, as it has no provisions for applying FPIs.

- Unless FPWs are disabled or ZERO_ON_ERROR is used:

   - the read in page must have an LSN newer than the last REDO LSN, otherwise
     there should have been an FPI

   - the page LSN may not have an LSN from the future (i.e BLK_DONE
     should never happen with FPWs)

- all FPIs in the record have been applied



A slightly different, but related, thing that we are missing is infrastructure
to detect pages that have been dirtied without associated WAL logging.  I
wonder if we could use a bit in the buffer descriptor to track if a page has
been dirtied by anything other than a hint bit write.  Or maybe we could
detect it when unlocking the page.


> > Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> > only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and 
> > forth
> > in my head about danger scenarios involving PITR or crashes inbetween
> > prepare_truncate() and the actual truncation.
> >
> >
> > ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> > and always WAL log the prepare_truncate() on the primary.  Does that sound
> > sane?
> >
>
> I'm considering writing a stress test to try to shake out issues in
> this.

We could certainly use more of that.


> That strategy was pretty successful for the online checksums patch, where we
> had a hunch it might not be quite correct- but it was hard to come up with a
> reproducer. The stress test failures were very useful in that they gave us
> proof of issues and some examples.


> But the crucial part is an ability to verify correctness.

Unfortunately I don't think we have a particularly good infrastructure for
detecting problems right now :(


The verification ideas I posted above would, I think, help to detect some
issues, but I don't think they'd catch more complicated things.


We have wal_consistency_checking, but as it runs just after the redo routine,
it doesn't catch problems like not including things in the WAL record or
covering related actions in two WAL records that could be separated by a
checkpoint.  We really should have a tool that compares a primary and a
replica after doing recovery using the masking infrastructure from
wal_consistency_checking.


> With the checksums it's easy enough - just verify checksums / look for
> checksum failures in the server log. But what would that be here?

Unfortunately that's not even something you can really can rely on, it's
entirely possible to see checksum errors for the FSM without it being a bug,
as it's not WAL logged.

Greetings,

Andres Freund


Reply via email to