PITR promote bug: Checkpointer writes to older timeline

Soumyadeep Chakraborty Tue, 02 Mar 2021 17:56:58 -0800

Hello hackers,

We came across an issue where the checkpointer writes to the older
timeline while a promotion is ongoing after reaching the recovery point
in a PITR, when there are prepared transactions before the recovery
point. We came across this issue first in REL_12_STABLE and saw that it
also exists in devel.


Setup:
PFA a minimal reproduction script repro.txt.

After running the script, we notice that the checkpointer has written
the end-of-recovery shutdown checkpoint in the previous timeline (tli =
1), i.e. it wrote into the WAL segment 000000010000000000000003 instead
of writing to the WAL segment 000000020000000000000003, causing it to
overwrite WAL records past the recovery point (please see attached diff
output file waldiff.diff) in 000000010000000000000003.

Also, performing a subsequent shutdown on the recovered server may cause
the next shutdown checkpoint record to be written, again, to the
previous timeline, i.e. to 000000010000000000000003. A subsequent server
start will fail as the starup process will be unable to find the
checkpoint in the latest timeline (000000020000000000000003) and we will
get:

...
LOG:  invalid record length at 0/3016FB8: wante
d 24, got 0
LOG:  invalid primary checkpoint record
PANIC:  could not locate a valid checkpoint record
...

RCA:

When there are prepared transactions in an older timeline, in the
checkpointer, a call to CheckPointTwoPhase() and subsequently to
XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads
to the following line:

read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);

GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read
the two phase WAL records in the older timeline. This variable will
remain unchanged and the checkpointer ends up writing the checkpoint
record into the older WAL segment (when XLogBeginInsert() is called
within CreateCheckPoint(), the value is still 1). The value is not
synchronized as even if RecoveryInProgress() is called,
xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE
(SharedRecoveryInProgress = true in older versions) as the startup
process waits for the checkpointer inside RequestCheckpoint() (since
recovery_target_action='promote' involves a non-fast promotion). Thus,
InitXLOGAccess() is not called and the value of ThisTimeLineID is not
updated before the checkpoint record write.

Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID
instead of a local variable, within read_local_xlog_page().

PFA a small patch that fixes the problem by explicitly calling
InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data
is read, in order to update ThisTimeLineID to the latest timeline. It is
okay to call InitXLOGAccess() as it is lightweight and would mostly be
a no-op.

Regards,
Soumyadeep, Kevin and Jimmy
VMWare

#! /bin/sh

# Primary setup

/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
echo "archive_mode = on
archive_command = 'cp %p /tmp/archive/%f'
max_prepared_transactions = 10" >> /usr/local/pgsql/data/postgresql.conf

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /tmp/logfile_p start

# Secondary setup

/usr/local/pgsql/bin/pg_basebackup -D /usr/local/pgsql/data_s -X stream && 
touch /usr/local/pgsql/data_s/recovery.signal
echo "port = 6432
restore_command = 'cp /tmp/archive/%f %p'
recovery_target_name = 'rp'
recovery_target_action = 'promote'
max_prepared_transactions = 20" >> /usr/local/pgsql/data_s/postgresql.conf

# Workload
echo "CREATE TABLE foo(i int);
BEGIN;
INSERT INTO foo VALUES(1);
PREPARE TRANSACTION 'fooinsert';
SELECT pg_create_restore_point('rp');
INSERT INTO foo VALUES(2);
SELECT pg_switch_wal();
" > /tmp/workload.sql
/usr/local/pgsql/bin/psql postgres -f /tmp/workload.sql

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data_s -l /tmp/logfile_s start

sleep 5

/usr/local/pgsql/bin/pg_waldump /tmp/archive/000000010000000000000003 > 
/tmp/archive_000000010000000000000003
/usr/local/pgsql/bin/pg_waldump 
/usr/local/pgsql/data_s/pg_wal/000000010000000000000003 > 
/tmp/standby_000000010000000000000003

diff -u /tmp/archive_000000010000000000000003 
/tmp/standby_000000010000000000000003 > /tmp/waldiff.diff
cat /tmp/waldiff.diff

From dbf5a9f6899bedf28b482fc03a4a2b0571e92e9b Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2...@gmail.com>
Date: Tue, 2 Mar 2021 17:41:20 -0800
Subject: [PATCH 1/1] Prevent checkpointer from writing to older timeline

Co-authored-by: Kevin Yeap <ky...@vmware.com>
---
 src/backend/access/transam/twophase.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 80d2d20d6cc..081aee6217e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1740,6 +1740,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
 	}
 	LWLockRelease(TwoPhaseStateLock);
 
+	if (serialized_xacts > 0)
+	{
+		/*
+		 * In order to ensure that we write the checkpoint record to the latest
+		 * timeline, refresh ThisTimeLineID. This is necessary if we had read
+		 * two phase data from an earlier timeline, above.
+		 */
+		InitXLOGAccess();
+	}
+
 	/*
 	 * Flush unconditionally the parent directory to make any information
 	 * durable on disk.  Two-phase files could have been removed and those
-- 
2.25.1

--- /tmp/archive_000000010000000000000003	2021-03-02 16:53:24.112110058 -0800
+++ /tmp/standby_000000010000000000000003	2021-03-02 16:53:54.296842452 -0800
@@ -40,6 +40,4 @@
 rmgr: Heap        len (rec/tot):     59/    59, tx:        540, lsn: 0/03016C28, prev 0/03016A90, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/13003/16384 blk 0
 rmgr: Transaction len (rec/tot):    242/   242, tx:        540, lsn: 0/03016C68, prev 0/03016C28, desc: PREPARE gid fooinsert: 2021-03-02 16:51:52.272477 PST
 rmgr: XLOG        len (rec/tot):     98/    98, tx:          0, lsn: 0/03016D60, prev 0/03016C68, desc: RESTORE_POINT rp
-rmgr: Heap        len (rec/tot):     59/    59, tx:        541, lsn: 0/03016DC8, prev 0/03016D60, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/13003/16384 blk 0
-rmgr: Transaction len (rec/tot):     34/    34, tx:        541, lsn: 0/03016E08, prev 0/03016DC8, desc: COMMIT 2021-03-02 16:51:52.273870 PST
-rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/03016E30, prev 0/03016E08, desc: SWITCH 
+rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/03016DC8, prev 0/03016D60, desc: CHECKPOINT_SHUTDOWN redo 0/3016DC8; tli 2; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown

PITR promote bug: Checkpointer writes to older timeline

Reply via email to