Hello hackers, We came across an issue where the checkpointer writes to the older timeline while a promotion is ongoing after reaching the recovery point in a PITR, when there are prepared transactions before the recovery point. We came across this issue first in REL_12_STABLE and saw that it also exists in devel.
Setup: PFA a minimal reproduction script repro.txt. After running the script, we notice that the checkpointer has written the end-of-recovery shutdown checkpoint in the previous timeline (tli = 1), i.e. it wrote into the WAL segment 000000010000000000000003 instead of writing to the WAL segment 000000020000000000000003, causing it to overwrite WAL records past the recovery point (please see attached diff output file waldiff.diff) in 000000010000000000000003. Also, performing a subsequent shutdown on the recovered server may cause the next shutdown checkpoint record to be written, again, to the previous timeline, i.e. to 000000010000000000000003. A subsequent server start will fail as the starup process will be unable to find the checkpoint in the latest timeline (000000020000000000000003) and we will get: ... LOG: invalid record length at 0/3016FB8: wante d 24, got 0 LOG: invalid primary checkpoint record PANIC: could not locate a valid checkpoint record ... RCA: When there are prepared transactions in an older timeline, in the checkpointer, a call to CheckPointTwoPhase() and subsequently to XlogReadTwoPhaseData() and subsequently to read_local_xlog_page() leads to the following line: read_upto = GetXLogReplayRecPtr(&ThisTimeLineID); GetXLogReplayRecPtr() will change ThisTimeLineID to 1, in order to read the two phase WAL records in the older timeline. This variable will remain unchanged and the checkpointer ends up writing the checkpoint record into the older WAL segment (when XLogBeginInsert() is called within CreateCheckPoint(), the value is still 1). The value is not synchronized as even if RecoveryInProgress() is called, xlogctl->SharedRecoveryState is not RECOVERY_STATE_DONE (SharedRecoveryInProgress = true in older versions) as the startup process waits for the checkpointer inside RequestCheckpoint() (since recovery_target_action='promote' involves a non-fast promotion). Thus, InitXLOGAccess() is not called and the value of ThisTimeLineID is not updated before the checkpoint record write. Since 1148e22a82e, GetXLogReplayRecPtr() is called with ThisTimeLineID instead of a local variable, within read_local_xlog_page(). PFA a small patch that fixes the problem by explicitly calling InitXLOGAccess() in CheckPointTwoPhase(), after the two phase state data is read, in order to update ThisTimeLineID to the latest timeline. It is okay to call InitXLOGAccess() as it is lightweight and would mostly be a no-op. Regards, Soumyadeep, Kevin and Jimmy VMWare
#! /bin/sh # Primary setup /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data echo "archive_mode = on archive_command = 'cp %p /tmp/archive/%f' max_prepared_transactions = 10" >> /usr/local/pgsql/data/postgresql.conf /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /tmp/logfile_p start # Secondary setup /usr/local/pgsql/bin/pg_basebackup -D /usr/local/pgsql/data_s -X stream && touch /usr/local/pgsql/data_s/recovery.signal echo "port = 6432 restore_command = 'cp /tmp/archive/%f %p' recovery_target_name = 'rp' recovery_target_action = 'promote' max_prepared_transactions = 20" >> /usr/local/pgsql/data_s/postgresql.conf # Workload echo "CREATE TABLE foo(i int); BEGIN; INSERT INTO foo VALUES(1); PREPARE TRANSACTION 'fooinsert'; SELECT pg_create_restore_point('rp'); INSERT INTO foo VALUES(2); SELECT pg_switch_wal(); " > /tmp/workload.sql /usr/local/pgsql/bin/psql postgres -f /tmp/workload.sql /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data_s -l /tmp/logfile_s start sleep 5 /usr/local/pgsql/bin/pg_waldump /tmp/archive/000000010000000000000003 > /tmp/archive_000000010000000000000003 /usr/local/pgsql/bin/pg_waldump /usr/local/pgsql/data_s/pg_wal/000000010000000000000003 > /tmp/standby_000000010000000000000003 diff -u /tmp/archive_000000010000000000000003 /tmp/standby_000000010000000000000003 > /tmp/waldiff.diff cat /tmp/waldiff.diff
From dbf5a9f6899bedf28b482fc03a4a2b0571e92e9b Mon Sep 17 00:00:00 2001 From: Soumyadeep Chakraborty <soumyadeep2...@gmail.com> Date: Tue, 2 Mar 2021 17:41:20 -0800 Subject: [PATCH 1/1] Prevent checkpointer from writing to older timeline Co-authored-by: Kevin Yeap <ky...@vmware.com> --- src/backend/access/transam/twophase.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 80d2d20d6cc..081aee6217e 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1740,6 +1740,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon) } LWLockRelease(TwoPhaseStateLock); + if (serialized_xacts > 0) + { + /* + * In order to ensure that we write the checkpoint record to the latest + * timeline, refresh ThisTimeLineID. This is necessary if we had read + * two phase data from an earlier timeline, above. + */ + InitXLOGAccess(); + } + /* * Flush unconditionally the parent directory to make any information * durable on disk. Two-phase files could have been removed and those -- 2.25.1
--- /tmp/archive_000000010000000000000003 2021-03-02 16:53:24.112110058 -0800 +++ /tmp/standby_000000010000000000000003 2021-03-02 16:53:54.296842452 -0800 @@ -40,6 +40,4 @@ rmgr: Heap len (rec/tot): 59/ 59, tx: 540, lsn: 0/03016C28, prev 0/03016A90, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 1663/13003/16384 blk 0 rmgr: Transaction len (rec/tot): 242/ 242, tx: 540, lsn: 0/03016C68, prev 0/03016C28, desc: PREPARE gid fooinsert: 2021-03-02 16:51:52.272477 PST rmgr: XLOG len (rec/tot): 98/ 98, tx: 0, lsn: 0/03016D60, prev 0/03016C68, desc: RESTORE_POINT rp -rmgr: Heap len (rec/tot): 59/ 59, tx: 541, lsn: 0/03016DC8, prev 0/03016D60, desc: INSERT off 2 flags 0x00, blkref #0: rel 1663/13003/16384 blk 0 -rmgr: Transaction len (rec/tot): 34/ 34, tx: 541, lsn: 0/03016E08, prev 0/03016DC8, desc: COMMIT 2021-03-02 16:51:52.273870 PST -rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 0/03016E30, prev 0/03016E08, desc: SWITCH +rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 0/03016DC8, prev 0/03016D60, desc: CHECKPOINT_SHUTDOWN redo 0/3016DC8; tli 2; prev tli 1; fpw true; xid 0:541; oid 24576; multi 1; offset 0; oldest xid 532 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown