I observed a curious bug in autovac just now. Since plain vacuum avoids calling GetTransactionSnapshot, an autovac worker that happens not to analyze any tables will never call GetTransactionSnapshot at all. This means it will arrive at vac_update_datfrozenxid with RecentGlobalXmin never having been changed from its boot value of FirstNormalTransactionId, which means that it will fail to update the database's datfrozenxid ... or, if the current value of datfrozenxid is past 2 billion, that it will improperly advance datfrozenxid to sometime in the future.
Once you get into this state in a reasonably idle database such as template1, autovac is completely dead in the water: if it thinks template1 needs to be vacuumed for wraparound, then every subsequent worker will be launched at template1, every one will fail to advance its datfrozenxid, rinse and repeat. Even before that happens, the DB's datfrozenxid will prevent clog truncation, which might explain some of the recent complaints. I've only directly tested this in HEAD, but I suspect the problem goes back a ways. On reflection I'm not even sure that this is strictly an autovacuum bug. It can be cast more generically as "RecentGlobalXmin getting used without ever having been set", and it sure looks to me like the HOT patch may have introduced a few risks of that sort. I'm thinking that maybe an appropriate fix is to insert a GetTransactionSnapshot call at the beginning of InitPostgres' transaction, thus ensuring that every backend has some vaguely sane value for RecentGlobalXmin before it tries to do any database access. Another thought is that even with that, an autovac worker is likely to reach vac_update_datfrozenxid with a RecentGlobalXmin value that was computed at the start of its run, and is thus rather old. I wonder why vac_update_datfrozenxid is using the variable at all rather than doing GetOldestXmin? It's not like that function is so performance-critical that it needs to avoid calling GetOldestXmin. Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData, is it actually necessary for lazy vacuum to avoid setting a snapshot? It seems like it might be a good idea for it to do so in order to keep its RecentGlobalXmin reasonably current. I've only looked at this in HEAD, but I am thinking that we have a real problem here in both HEAD and 8.3. I'm less sure how bad things are in the older branches. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers