Hi, On 2023-06-21 15:12:08 -0700, Andres Freund wrote: > When vac_truncate_clog() returns early, due to one of these paths: > > [...] > > Separately, I think it's quite bad that we *silently* return from > vac_truncate_clog() when finding a bogus xid. That's a quite severe condition, > we should at least tell the user about it.
A related issue is that as far as I can tell the determination of what is bogus is bogus. The relevant cutoffs are determined vac_update_datfrozenxid() using: /* * Identify the latest relfrozenxid and relminmxid values that we could * validly see during the scan. These are conservative values, but it's * not really worth trying to be more exact. */ lastSaneFrozenXid = ReadNextTransactionId(); lastSaneMinMulti = ReadNextMultiXactId(); but doing checks based on thos is bogus, because: a) a concurrent create table / truncate / vacuum can update pg_class.relfrozenxid of some relation in the current database to a newer value, after lastSaneFrozenXid already has been determined. If that happens, we skip updating pg_database.datfrozenxid. b) A concurrent vacuum in another database, ending up in vac_truncate_clog(), can compute a newer datfrozenxid. In that case the vac_truncate_clog() with the outdated lastSaneFrozenXid will not truncate the clog (and also forget to release WrapLimitsVacuumLock currently, as reported upthread) and not call SetTransactionIdLimit(). The latter is particularly bad, because that means we might not come out of "database is not accepting commands" land. I think in both cases a later call might fix the issue, but that could be some way out, if autovacuum doesn't see further writes being necessary, and no further write activity happens, because of ""database is not accepting commands". It's not entirely obvious to me how to best fix these. For a second I thought we just need to acquire a snapshot before determining the sane values, but that doesn't work, since we update the relevant fields with heap_inplace_update(). I guess we could just recompute the boundaries before actually believing the catalog values are bogus? I think we also add warnings to these paths, so we actually have a chance to find problems in the field. Greetings, Andres Freund