On 13.5 a wal flush PANIC is encountered after a standby is promoted.

With debugging, it was found that when a standby skips a missing continuation 
record on recovery, the missingContrecPtr is not invalidated after the record 
is skipped. Therefore, when the standby is promoted to a primary it writes an 
overwrite_contrecord with an LSN of the missingContrecPtr, which is now in the 
past. On flush time, this causes a PANIC. From what I can see, this failure 
scenario can only occur after a standby is promoted.

The overwrite_contrecord was introduced in 13.5 with 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.

Attached is a patch and a TAP test to handle this condition. The patch ensures 
that an overwrite_contrecord is only created if the missingContrecPtr is ahead 
of the last wal record.

To reproduce:
Run the new tap test recovery/t/029_overwrite_contrecord_promotion.pl without 
the attached patch

2022-02-22 18:38:15.526 UTC [31138] LOG:  started streaming WAL from primary at 
0/2000000 on timeline 1
2022-02-22 18:38:15.535 UTC [31105] LOG:  successfully skipped missing 
contrecord at 0/1FFC620, overwritten at 2022-02-22 18:38:15.136482+00
2022-02-22 18:38:15.535 UTC [31105] CONTEXT:  WAL redo at 0/2000028 for 
XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC620; time 2022-02-22 18:38:15.136482+00
…
…..
2022-02-22 18:38:15.575 UTC [31103] PANIC:  xlog flush request 0/201EC70 is not 
satisfied --- flushed only to 0/2000088
2022-02-22 18:38:15.575 UTC [31101] LOG:  checkpointer process (PID 31103) was 
terminated by signal 6: Aborted
….
…..

With the patch, running the same tap test succeeds and a PANIC is not observed.

Thanks

Sami Imseih
Amazon Web Services



Attachment: 0001-Fix-missing-continuation-record-after-standby-promot.patch
Description: 0001-Fix-missing-continuation-record-after-standby-promot.patch

Reply via email to