Hi hackers!

We hit an interesting bug(?) in archive recovery during a planned HA
switchover. One of standbys followed wrong timeline. 
I think I saw this problem before in pgsql-hackers but googling didn't help.

Here's employed switchover algorithm:
0. Old Primary's synchronoud_standbby_names turned to only New Primary.
1. Standbys (except New Primary) conninfor turned to New Primary, so 
they are cascade now.
2. Old Primary is told to shut down asynchronously
3. After 5 seconds New Primary is promoted
4. When Old Primary shutted down, it is rewinded if shutdown took more
than 5s.

Here are logs from standby that stayed on timeline of Old Primary, sorry for 
verbosity:


2026-02-18 11:50:01.173 MSK,,,2534,,69957d39.9e6,1,,2026-02-18 11:50:01 
MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A3000000 on 
timeline 9",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:01.017 MSK,,,2318,,69957d35.90e,7,,2026-02-18 11:49:57 
MSK,,0,LOG,00000,"received SIGHUP, reloading configuration 
files",,,,,,,,,"","postmaster",,0
2026-02-18 11:52:01.018 MSK,,,2318,,69957d35.90e,8,,2026-02-18 11:49:57 
MSK,,0,LOG,00000,"parameter ""primary_conninfo"" changed to 
""host=rc1a-skip.mdb.yandexcloud.net port=5432 user=repl keepalives_idle=15 
keepalives_interval=3 keepalives_count=5 
application_name=rc1d_skip_mdb_yandexcloud_n>
2026-02-18 11:52:01.020 MSK,,,2320,,69957d36.910,26,,2026-02-18 11:49:58 
MSK,1/0,0,LOG,00000,"WAL receiver process shutdown 
requested",,,,,,,,,"","startup",,0
2026-02-18 11:52:01.020 MSK,,,2534,,69957d39.9e6,2,,2026-02-18 11:50:01 
MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator 
command",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:01.186 MSK,,,4227,,69957db1.1083,1,,2026-02-18 11:52:01 
MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A7000000 on 
timeline 9",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.193 MSK,,,4227,,69957db1.1083,2,,2026-02-18 11:52:01 
MSK,,0,LOG,00000,"replication terminated by primary server","End of WAL reached 
on timeline 9 at 2EC2/A71F0820.",,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.198 MSK,,,4227,,69957db1.1083,3,,2026-02-18 11:52:01 
MSK,,0,LOG,00000,"fetching timeline history file for timeline 10 from primary 
server",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.207 MSK,,,4227,,69957db1.1083,4,,2026-02-18 11:52:01 
MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator 
command",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:12.585 MSK,,,2320,,69957d36.910,27,,2026-02-18 11:49:58 
MSK,1/0,0,LOG,00000,"restored log file ""0000000A.history"" from 
archive",,,,,,,,,"","startup",,0
2026-02-18 11:52:12.588 MSK,,,2320,,69957d36.910,28,,2026-02-18 11:49:58 
MSK,1/0,0,LOG,00000,"new target timeline is 10",,,,,,,,,"","startup",,0
2026-02-18 11:52:12.861 MSK,,,2320,,69957d36.910,29,,2026-02-18 11:49:58 
MSK,1/0,0,LOG,00000,"restored log file ""0000000900002EC2000000A7"" from 
archive",,,,,,,,,"","startup",,0
2026-02-18 11:52:13.122 MSK,,,2320,,69957d36.910,30,,2026-02-18 11:49:58 
MSK,1/0,0,LOG,00000,"unexpected pageaddr 2EC2/8000000 in log segment 
0000000900002EC2000000A8, offset 0",,,,,,,,,"","startup",,0
2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,1,,2026-02-18 11:52:13 
MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A8000000 on 
timeline 10",,,,,,,,,"","walreceiver",,0
2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,2,,2026-02-18 11:52:13 
MSK,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR:  requested 
starting point 2EC2/A8000000 is ahead of the WAL flush position of this server 
2EC2/A71F6AE0",,,,,,,,,"","walreceiver",,0

One interesting part is "fetching timeline history file for timeline
 10 from primary" was killed, probably by startup process reaching
timeline end. (that's a separate issue worth looking at, but survivable
- the archive is there as a fallback)

We restored history file successfully from archive and decided to follow
timeline 10. But 0000000_A_00002EC2000000A7 was not archived yet, while
0000000_9_00002EC2000000A7 already was there. So we took
0000000900002EC2000000A7 and applied. Now we are beyond timeline
switchpoint!

The archive contained the new timeline's history file (so recovery
correctly identified TL10 as the target) but the WAL segment at the
switch-point position had not yet been archived for TL10.  The same
segment number existed on TL9 (the old primary continued writing after
promotion).  Recovery applied it:

  restored log file "0000000900002EC2000000A7" from archive

That segment carries post-divergence data from the old primary.
Applying it is wrong and the damage is silent, we cannot get to timeline
10 now.


Root cause: XLogFileReadAnyTLI iterates expectedTLEs newest-first and
falls back to older timelines when a segment is not found.  The
existing beginseg check is meant to prevent this, but it only guards
the final timeline's switch point (targetBeginSeg = target TL's
begin_seg).  In a three-timeline chain (TL1 -> TL2 -> TL3), an absent
TL2 segment at TL2's own switch point is not guarded because
segno < TL3.begin_seg, so TL1's divergent segment is used.

Fix: replace the targetBeginSeg pre-check with a found_eligible flag.
The invariant is simple: for any segment, exactly one timeline owns it
-- the newest one whose begin_seg <= segno.  If that segment is absent,
recovery must stop, not fall back.  Once the loop identifies the first
eligible timeline and fails to open its segment, it breaks rather than
continuing to older timelines.

Two TAP tests cover this:
 - t/052: two-timeline case (the original scenario)
 - t/053: three-timeline case that the old targetBeginSeg guard missed

I think same fix is correct even if we have multiple swithpoints in one
segment.

I recognise this changes existing behaviour.  Someone who intentionally
archives only a parent timeline's segments and relies on fallback will
now see recovery stall instead of silently proceeding.  I think that
outcome is strictly better: the current behaviour opens a window for
applying divergent data with no error, which is worse than a clear
stall waiting for the right segment.

Two related issues I'll address in separate threads later:
 - recovery_target_timeline='latest' picks by timeline ID.  In a
   partition scenario the numerically highest TL may not have the most
   data.  Choosing by end LSN would be safer. Maybe even at some point
   we will have to stop using incremented numbers for timelines...
 - The startup-kills-walreceiver interaction that created the fallback
   to archive in the first place.

Patch attached. WDYT?


Best regards, Andrey Borodin.

Attachment: 0001-Fix-archive-recovery-falling-back-to-wrong-timeline-.patch
Description: Binary data

Reply via email to