We have the rm_safe_restartpoint mechanism to ensure that we don't use a checkpoint that splits a multi-level B-tree insertion as a restart point. But to my surprise, we don't have anything to protect against the analogous case during normal operation. This is possible:

1. Split child page. Write WAL records for the child pages.
2. Begin and finish a checkpoint
3. Crash, before writing the WAL record of inserting the child pointer in the parent B-tree page. 4. Recovery begins at the new checkpoint, never sees the incomplete split, so it stays incomplete.

In practice that's pretty hard to hit, because a checkpoint takes some time, while locking the parent page and writing the child pointer is usually very quick. But it's possible.

It surprises me that we thought of this when we introduced restartpoints, but this more obvious case during normal operation seems to have been there forever. Nothing very bad happens if you lose the parent update, but this would be nice to fix nevertheless.

I bumped into this while thinking about archive recovery - the above can happen at archive recovery too if the checkpoint is caused by pg_start_backup().

I think we can fix this by requiring that any multi-WAL-record actions that are in-progress when a checkpoint starts (at the REDO-pointer) must finish before the checkpoint record is written. That will close the issue with restartpoints, archive recovery etc. as well, so we no longer need to worry about this anywhere else than while performing an online checkpoint.

I'm thinking of using the isCommit flag for this, to delay writing the checkpoint record until all incomplete splits are finished. isCommit protects against a similar race condition between writing commit record and flushing the clog page, this race condition is similar. Will obviously need to rename it, and double-check that it's safe: b-tree splits take longer, and there's no critical section there like there is in the commit codepath.

Comments?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to