[HACKERS] Spreading full-page writes

Heikki Linnakangas Sun, 25 May 2014 14:53:26 -0700

Here's an idea I tried to explain to Andres and Simon at the pub lastnight, on how to reduce the spikes in the amount of WAL written atbeginning of a checkpoint that full-page writes cause. I'm just writingthis down for the sake of the archives; I'm not planning to work on thismyself.

When you are replaying a WAL record that lies between the Redo-pointerof a checkpoint and the checkpoint record itself, there are twopossibilities:


a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already ina consistent state.

In case b), you wouldn't need to replay any full-page images, normaldifferential WAL records would be enough. In case a), you do, and youwon't be consistent until replaying all the WAL up to the checkpoint record.

We can exploit those properties to spread out the spike. When you modifya page and you're about to write a WAL record, check if the page has theBM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the pageagainst the *previous* checkpoints redo-pointer, instead of the one'sthat's currently in-progress. If no full-page image is required based onthat comparison, IOW if the page was modified and a full-page image wasalready written after the earlier checkpoint, write a normal WAL recordwithout full-page image and set a new flag in the buffer header(BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.

When checkpointer (or any other backend that needs to evict a buffer) isabout to flush a page from the buffer cache that has the BM_NEEDS_FPWflag set, write a new WAL record, containing a full-page-image of thepage, before flushing the page.


Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don'treplay that record at all. It's OK because we know that there will be aseparate record containing the full-page image of the page later in thestream.

b) You are continuing WAL replay that started from an earliercheckpoint, and have already reached consistency.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED,replay it normally. It's OK, because the flag means that the page wasmodified after the earlier checkpoint already, and hence we must haveseen a full-page image of it already. When you see one of the WALrecords containing a separate full-page-image, ignore it.

This scheme make the b-case behave just as if the new checkpoint wasnever started. The regular WAL records in the stream are identical towhat they would've been if the redo-pointer pointed to the earliercheckpoint. And the additional FPW records are simply ignored.

In the a-case, it's not be safe to replay the records marked withXLR_FPW_SKIPPED, because they don't contain FPWs, and you have all theusual torn-page hazards that comes with that. However, the separate FPWrecords that come later in the stream will fix-up those pages.

Now, I'm sure there are issues with this scheme I haven't thought about,but I wanted to get this written down. Note this does not reduce theoverall WAL volume - on the contrary - but it ought to reduce the spike.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Spreading full-page writes

Reply via email to