On 3/12/26 22:26, Andres Freund wrote:
> Hi,
> 
> On 2026-03-12 20:27:32 +0100, Tomas Vondra wrote:
>>> I've been working on a fix for this (still pretty raw though).  What a mess.
>>>
>>>
>>> As part of validating that I added error checks that disallow reads in the
>>> startup process for anything other than the FSM fork.  Which triggered, due 
>>> to
>>> visibilitymap_prepare_truncate() reading a VM page.
>>>
>>> Which in turn made me wonder: Is it actually OK for
>>> visibilitymap_prepare_truncate() to do separate WAL logging from
>>> XLOG_SMGR_TRUNCATE?
>>>
>>> I suspect not, consider this scenario:
>>>
>>> 1) primary does XLOG_FPI for the VM page
>>> 2) checkpoint
>>> 3) primary does XLOG_SMGR_TRUNCATE
>>> 4) standby replays XLOG_FPI
>>> 5) primary performs a restartpoint
>>> 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
>>>    visibilitymap_prepare_truncate, which dirties the VM page
>>> 7) standby crashes while writing out the VM page
>>> 8) standby does recovery and now finds a torn VM page, triggering a checksum
>>>    failure
>>>
>>> Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd 
>>> just
>>> wipe out that region of the VM?
>>>
>>
>> Shouldn't it be possible to test this using injection points?
> 
> Not trivially, I think?  For one, you're not normally going to get a torn
> page, even if you kill the server in the right moment, so you're rarely going
> to be able to reproduce checksum errors without further infrastructure.  But
> also, since the failure behaviour is going to be to clear the VM, you'd not
> really see bad consequences, I think?
> 
> 
> I think we really need some logic during WAL replay that does more
> verification of pages that are read in (i.e. do not have an FPI applied) by
> the startup process.
> 
> - Direct uses XLogReadBufferExtended() should only be allowed with
>   ZERO_ON_ERROR, as it has no provisions for applying FPIs.
> 
> - Unless FPWs are disabled or ZERO_ON_ERROR is used:
> 
>    - the read in page must have an LSN newer than the last REDO LSN, otherwise
>      there should have been an FPI
> 
>    - the page LSN may not have an LSN from the future (i.e BLK_DONE
>      should never happen with FPWs)
> 
> - all FPIs in the record have been applied
> 
> 
> 
> A slightly different, but related, thing that we are missing is infrastructure
> to detect pages that have been dirtied without associated WAL logging.  I
> wonder if we could use a bit in the buffer descriptor to track if a page has
> been dirtied by anything other than a hint bit write.  Or maybe we could
> detect it when unlocking the page.
> 
> 
>>> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
>>> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and 
>>> forth
>>> in my head about danger scenarios involving PITR or crashes inbetween
>>> prepare_truncate() and the actual truncation.
>>>
>>>
>>> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
>>> and always WAL log the prepare_truncate() on the primary.  Does that sound
>>> sane?
>>>
>>
>> I'm considering writing a stress test to try to shake out issues in
>> this.
> 
> We could certainly use more of that.
> 
> 
>> That strategy was pretty successful for the online checksums patch, where we
>> had a hunch it might not be quite correct- but it was hard to come up with a
>> reproducer. The stress test failures were very useful in that they gave us
>> proof of issues and some examples.
> 
> 
>> But the crucial part is an ability to verify correctness.
> 
> Unfortunately I don't think we have a particularly good infrastructure for
> detecting problems right now :(
> 
> 
> The verification ideas I posted above would, I think, help to detect some
> issues, but I don't think they'd catch more complicated things.
> 
> 
> We have wal_consistency_checking, but as it runs just after the redo routine,
> it doesn't catch problems like not including things in the WAL record or
> covering related actions in two WAL records that could be separated by a
> checkpoint.  We really should have a tool that compares a primary and a
> replica after doing recovery using the masking infrastructure from
> wal_consistency_checking.
> 

Silly idea - wouldn't it be possible to detect this particular issue
solely based on WAL? We could even have an off-line tool that reads WAL
and checks that we have all FPIs for all the changes.

> 
>> With the checksums it's easy enough - just verify checksums / look for
>> checksum failures in the server log. But what would that be here?
> 
> Unfortunately that's not even something you can really can rely on, it's
> entirely possible to see checksum errors for the FSM without it being a bug,
> as it's not WAL logged.
> 

I did not mean to imply that any checksum failure is necessarily an
outright bug. But I think this kind of issues being "normal" is a hint
maybe not WAL-logging FSM is not such a good design choice anymore.

regards

-- 
Tomas Vondra



Reply via email to