Hi,

On 2026-03-12 22:54:25 +0100, Tomas Vondra wrote:
> >> But the crucial part is an ability to verify correctness.
> > 
> > Unfortunately I don't think we have a particularly good infrastructure for
> > detecting problems right now :(
> > 
> > 
> > The verification ideas I posted above would, I think, help to detect some
> > issues, but I don't think they'd catch more complicated things.
> > 
> > 
> > We have wal_consistency_checking, but as it runs just after the redo 
> > routine,
> > it doesn't catch problems like not including things in the WAL record or
> > covering related actions in two WAL records that could be separated by a
> > checkpoint.  We really should have a tool that compares a primary and a
> > replica after doing recovery using the masking infrastructure from
> > wal_consistency_checking.
> > 
> 
> Silly idea - wouldn't it be possible to detect this particular issue
> solely based on WAL? We could even have an off-line tool that reads WAL
> and checks that we have all FPIs for all the changes.

Afaict the problem is that you can't easily see that from the WAL in case of
bugs. E.g. for the case at hand, we didn't include block references for the
cleared VM pages in the WAL record. So such a tool would need to know about
the XLH_INSERT_ALL_VISIBLE_CLEARED flag etc.


> >> With the checksums it's easy enough - just verify checksums / look for
> >> checksum failures in the server log. But what would that be here?
> > 
> > Unfortunately that's not even something you can really can rely on, it's
> > entirely possible to see checksum errors for the FSM without it being a bug,
> > as it's not WAL logged.
> > 
> 
> I did not mean to imply that any checksum failure is necessarily an
> outright bug. But I think this kind of issues being "normal" is a hint
> maybe not WAL-logging FSM is not such a good design choice anymore.

I agree we'll eventually have to fix this, I just don't entirely know how.
Perhaps it'd be good enough to essentially treat the FSM as hint bits,
i.e. WAL log the first modification of an FSM page in a checkpoint, if
checksums/log-hints is on.

Greetings,

Andres Freund


Reply via email to