On Thu, Jun 20, 2024 at 7:42 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > If vacuum fails to remove a tuple with xmax older than > VacuumCutoffs->OldestXmin and younger than > GlobalVisState->maybe_needed, it will ERROR out when determining > whether or not to freeze the tuple with "cannot freeze committed > xmax". > > In back branches starting with 14, failing to remove tuples older than > OldestXmin during pruning caused vacuum to infinitely loop in > lazy_scan_prune(), as investigated on this [1] thread.
This is a great summary. > We can fix this by always removing tuples considered dead before > VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby > has a transaction that sees that tuple as alive, because it will > simply wait to replay the removal until it would be correct to do so > or recovery conflict handling will cancel the transaction that sees > the tuple as alive and allow replay to continue. I think that this is the right general approach. > The repro forces a round of index vacuuming after the standby > reconnects and before pruning a dead tuple whose xmax is older than > OldestXmin. > > At the end of the round of index vacuuming, _bt_pendingfsm_finalize() > calls GetOldestNonRemovableTransactionId(), thereby updating the > backend's GlobalVisState and moving maybe_needed backwards. Right. I saw details exactly consistent with this when I used GDB against a production instance. I'm glad that you were able to come up with a repro that involves exactly the same basic elements, including index page deletion. -- Peter Geoghegan