Hi,

On 2026-02-07 14:38:53 +0200, Heikki Linnakangas wrote:
> On 03/02/2026 00:33, Andres Freund wrote:
> >    - Now that we use the normal order of WAL logging, we don't need to delay
> >      checkpoint starts anymore.
> >
> >      I think the explanation for why that is ok is correct [1], but it 
> > needs to
> >      be looked at by somebody with experience around this. Maybe Heikki?
>
> So that's patch 0004 "bufmgr: Switch to standard order in
> MarkBufferDirtyHint()". Yes, looks correct to me.

Thanks for checking!  Somehow I went back and forth about it being right
multiple times...


> >     /*
> >      * Update RedoRecPtr so that we can make the right decision. It's 
> > possible
> >      * that a new checkpoint will start just after GetRedoRecPtr(), but that
> >      * is ok, as the buffer is already dirty, ensuring that any BufferSync()
> >      * started after the buffer was marked dirty cannot complete without
> >      * flushing this buffer.  If a checkpoint started between marking the
> >      * buffer dirty and this check, we will emit an unnecessary WAL record 
> > (as
> >      * the buffer will be written out as part of the checkpoint), but the
> >      * window for that is small.
> >      */
> >     RedoRecPtr = GetRedoRecPtr();
>
> That "small window" is actually pretty big if you think of it a little more
> loosely. Our rule is that we write the full page image if a checkpoint has
> started since the page LSN, but that's very conservative already. It would
> be sufficient to write the full page image only if the checkpoint has
> already flushed the page. This small window is just a special case of that
> conservatism.

I mainly want to mention that window because I have to think about it when
analyzing the correctness of the approach. If the window is not mentioned, at
least I have to think about whether the window is dangerous in some form.


> It would be sufficient to write the full page image only if the checkpoint
> has already flushed the page.

Today that would probably not quite be sufficient, due to issues around
re-dirtying the page during checkpointer's flush (and thus needing to be
written out again, with the chance of a torn write that has no FPI to repair
it). But that will soon be impossible.


I think the actual rule would need to be more complicated, I think we would
need to generate an FPI for the first modification after the checkpoint flush,
even though the LSN is newer than the redo LSN, because we didn't generate one
earlier?  Otherwise we could get into a situation where there is no non-torn
on-disk page version after a later crash, I think?

Consider:

1) modify page w/ FPI
2) redo pointer determined at X
3) modify page w/o FPI, as the page hasn't yet been flushed at X+1
4) checkpointer flushes page
5) checkpoint completes, at X+2
6) page is dirtied, w/o FPI X+3, as X+1 > X
7) in the middle of writing out the page, we crash, the page is torn

For recovery we will replay starting from position X. Then will replay the
record from 3), which will be skipped due to the LSN. Then we will replay X+3,
which either will be skipped due to the LSN condition (if the page header
survived the torn page), leading to the changes to the "old portion" of the
torn page not being replayed, or we will replay the WAL record, applying it to
a torn page (or failing to read in the page due to checksum errors).

If we only needed to think about buffers that stay in memory, we could "just"
tackle this by remember that the page will need to be FPId during the next
modification in the BufferDesc, but that doesn't help us if the page is
evicted and reread...



> I've been thinking of trying track that more accurately for a long time,
> because it would smoothen the WAL spike when a checkpoint begins.

It'd indeed be nice to improve that. Another thing it'd be helpful is widening
when we can write out hint bits on standbys.

If the rule were just that we can skip an FPI if the page still needs to be
written out by the checkpoint, it'd be fairly simple - we could utilize
BM_CHECKPOINT_NEEDED. But as hinted at above, I think it's a it more
complicated.


Greetings,

Andres Freund


Reply via email to