Here's an idea I tried to explain to Andres and Simon at the pub last night, on how to reduce the spikes in the amount of WAL written at beginning of a checkpoint that full-page writes cause. I'm just writing this down for the sake of the archives; I'm not planning to work on this myself.

When you are replaying a WAL record that lies between the Redo-pointer of a checkpoint and the checkpoint record itself, there are two possibilities:

a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already in a consistent state.

In case b), you wouldn't need to replay any full-page images, normal differential WAL records would be enough. In case a), you do, and you won't be consistent until replaying all the WAL up to the checkpoint record.

We can exploit those properties to spread out the spike. When you modify a page and you're about to write a WAL record, check if the page has the BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page against the *previous* checkpoints redo-pointer, instead of the one's that's currently in-progress. If no full-page image is required based on that comparison, IOW if the page was modified and a full-page image was already written after the earlier checkpoint, write a normal WAL record without full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.

When checkpointer (or any other backend that needs to evict a buffer) is about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag set, write a new WAL record, containing a full-page-image of the page, before flushing the page.

Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't replay that record at all. It's OK because we know that there will be a separate record containing the full-page image of the page later in the stream.

b) You are continuing WAL replay that started from an earlier checkpoint, and have already reached consistency.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it normally. It's OK, because the flag means that the page was modified after the earlier checkpoint already, and hence we must have seen a full-page image of it already. When you see one of the WAL records containing a separate full-page-image, ignore it.

This scheme make the b-case behave just as if the new checkpoint was never started. The regular WAL records in the stream are identical to what they would've been if the redo-pointer pointed to the earlier checkpoint. And the additional FPW records are simply ignored.

In the a-case, it's not be safe to replay the records marked with XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual torn-page hazards that comes with that. However, the separate FPW records that come later in the stream will fix-up those pages.


Now, I'm sure there are issues with this scheme I haven't thought about, but I wanted to get this written down. Note this does not reduce the overall WAL volume - on the contrary - but it ought to reduce the spike.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to