Hi, Thanks ChangAo for v2, and Kyotaro for working through the alternatives. I think v2 is on the right track.
For me the Flush LSN looks correct for the below reasons: 1. The standby can reach it. XLOG_SWITCH (driven by archive_timeout) calls XLogFlush(EndPos) with EndPos = segment boundary. xlogreader treats SWITCH specially and advances NextRecPtr to end-of-segment (XLogDecodeNextRecord()), so lastReplayedEndRecPtr lands on the segment boundary and the consistency check fires. With v2, minRecoveryPoint = segment boundary, which is exactly reachable. This restores the implicit producer-side invariant: pg_basebackup's XLOG_BACKUP_END EndRecPtr, UpdateMinRecoveryPoint() via GetCurrentReplayRecPtr(), and the standby-source path in pg_rewind all yield a value record-driven progress can match. Caveat: XLogBackgroundFlush rounds WriteRqst.Write down to a page boundary, and the asyncXactLSN bump only re-targets when we've already flushed past it. On a busy primary the flush LSN can therefore sit mid-record, but that's fine because live insertions advance the standby past whatever endrec we pin. Not the scenario in this bug, just worth noting that no producer-side or recovery-side fix eliminates every theoretical liveness corner. 2. GetXLogInsertRecPtr() reads CurrBytePos from shared memory not durable. If the source crashes after pg_rewind, it recovers to its flush LSN, leaving the target's minRecoveryPoint pinned to a higher insert LSN which is unreachable forever. flush LSN sits at the segment boundary, which is on disk and recovery cannot regress past it. So this approach is crash-safe 3. perform_rewind() does: copy files -> fetch source control file -> get_current_wal_flush_lsn. Any WAL-logged page on source's disk has page-LSN <= source's flush-LSN at the time it was written (FlushBuffer enforces XLogFlush before eviction; hint-bit pages with checksums use XLogSaveBufferForHint + PageSetLSN). Flush LSN is monotonic, so the value dominates every copied page-LSN. FSM/VM pages are exempt by design (freespace.c uses MarkBufferDirtyHint without WAL) but aren't gated by minRecoveryPoint. Unlogged tables are out of scope. Torn pages are closed by the existing full_page_writes=on precondition. Review comments 1. Commit message understates the fix. It only describes the page-header symptom. The crash-safety property is the stronger argument and the one that resolves the back-and-forth on which LSN to use. Suggest adding: a. "Using the flush LSN is also crash-safe with respect to the source: the insert LSN lives only in shared memory and can be lost on a source crash, leaving the standby's minRecoveryPoint ahead of any LSN the source can subsequently reach." 2. Code comment should explain why flush LSN is sufficient. The current "We must replay to the last WAL flush location" doesn't say why. Suggest: a. "Use the source's flush LSN as the target's minRecoveryPoint: every WAL-logged page we copied has page-LSN <= source's flush LSN at copy time (WAL-before-data), and flush LSN is monotonic. We avoid the insert LSN because it can sit one page-header past a record's end at segment boundaries (where no record will end), and it is not durable, a source crash can leave flush LSN behind an insert LSN we already pinned." 3. Worth a comment in rewind_source.h that the callback must only be invoked against a non-standby source, pg_current_wal_flush_lsn() errors out under recovery. 4. No regression test. We can add a regression test under src/bin/pg_rewind/t/. Otherwise +1 on the v2 direction. Regards, Surya Poondla
