Unable to start replica after failover

Lahnov, Igor Mon, 01 Aug 2022 02:15:50 -0700

Hi,
We're using asynchronous streaming replication together with WAL archiving and 
recovery (we've set these parameters: archive_command, restore_command, 
recovery_target_timeline=latest).
To manage backup and recovery, we're using pg_probackup.
We have a 3-node cluster managed by Patroni on Raft.
While running our cluster, we encounter a recurring problem: after a failover, 
the former leader cannot recover neither via the tools provided by PostgreSQL 
nor with the Patroni tools (pg_rewind).
We've seen a number of different WAL reading errors. For example:
*         "could not find previous WAL record at E6F/C2436F50: invalid resource 
manager ID 139 at E6F/C2436F50"; or
*         "could not find previous WAL record at 54E/FB348118: unexpected 
pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, offset 3448832".
There have been some other variations but all such errors have the same cause:
1.       At startup, the former leader does not immediately get access to the 
backup. As a result, the recovery process uses the local WAL, which inevitably 
leads to a split brain since in most cases there is a lag in replication after 
a failover.
2.       Next, the former leader finally manages to read the latest log from 
the backup, and that log has a higher priority than the local log, even though 
the recovery with the local log has already occurred.
As a result, we see the following:
"new timeline 3 forked off current database system timeline 2 before current 
recovery point 54E/FB348150".
When pg_rewind is run, it also uses the log from the backup (the lagging log 
from the new leader) instead of the partial log with which the former leader 
has already been started.
Next, pg_rewind returns errors while reading the log from the backup back, 
looking for the last checkpoint, which is quite reasonable because, once a new 
leader starts, the point of divergence normally ends up in the next timeline 
and the previous timeline's backup log does not have a block with the LSN of 
the divergence.
Also, in this case, it is impossible to build a map of the pages that have been 
changed to roll back the changes using pg_rewind.


To avoid the problem, we decided to stop using restore_command. Could you 
please let us know if there is a better solution to the problem we've described?

Unable to start replica after failover

Reply via email to