On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <and...@anarazel.de> wrote: > I'm not sure what I was thinking "back then", but right now I'd argue > that the best lock against vacuum isn't a SUE, but announcing the > correct ->xmin, so you can be sure that clog entries won't be yanked out > from under you. Potentially with the right flag sets to avoid old enough > tuples eing pruned.
I was just thinking about this some more (and talking it over with Mark) and I think this might actually be a really bad idea. One problem with it is that it means that the oldest-xmin value can go backward, which is something that I think has caused us some problems before. There are some other cases where it can happen, and I'm not sure that there's any necessarily fatal problem with doing it in this case, but it would definitely be a shame if this contrib module broke something for core in a way that was hard to fix. But let's leave that aside and suppose that there is no fatal problem there. Essentially what we're talking about here is advertising the table's relfrozenxid as our xmin. How old is that likely to be? Maybe pretty old. The default value of vacuum_freeze_table_age is 150 million transactions, and that's just the trigger to start vacuuming; the actual value of age(relfrozenxid) could easily be higher than that. But even if it's only a fraction of that, it's still pretty bad. Advertising an xmin half that old (75 million transactions) is equivalent to keeping a snapshot open for an amount of time equal to however long it takes you to burn through 75 million XIDs. For instance, if you burn 10 million XIDs/hour, that's the equivalent of keeping a snapshot open for 7.5 hours. In other words, it's quite likely that doing this is going to make VACUUM (and HOT pruning) drastically less effective throughout the entire database cluster. To me, this seems a lot worse than just taking ShareUpdateExclusiveLock on the table. After all, ShareUpdateExclusiveLock will prevent VACUUM from running on that table, but it only affects that one table rather than the whole cluster, and it "only" stops VACUUM from running, which is still better than having it do lots of I/O but not clean anything up. I think I see another problem with this approach, too: it's racey. If some other process has entered vac_update_datfrozenxid() and has gotten past the calls to GetOldestXmin() and GetOldestMultiXactId(), and we then advertise an older xmin (and I guess also oldestMXact) it can still go on to update datfrozenxid/datminmxid and then truncate the SLRUs. Even holding XactTruncationLock is insufficient to protect against this race condition, and there doesn't seem to be any other obvious approach, either. So I would like to back up a minute and lay out the possible solutions as I understand them. The specific problem here I'm talking about here is: how do we keep from looking up an XID or MXID whose information might have been truncated away from the relevant SLRU? 1. Take a ShareUpdateExclusiveLock on the table. This prevents VACUUM from running concurrently on this table (which sucks), but that for sure guarantees that the table's relfrozenxid and relminmxid can't advance, which precludes a concurrent CLOG truncation. 2. Advertise an older xmin and minimum MXID. See above. 3. Acquire XactTruncationLock for each lookup, like pg_xact_status(). One downside here is a lot of extra lock acquisitions, but we can mitigate that to some degree by caching the results of lookups, and by not doing it for XIDs that our newer than our advertised xmin (which must be OK) or at least as old as the newest XID we previously discovered to be unsafe to look up (because those must not be OK either). The problem case is a table with lots of different XIDs that are all new enough to look up but older than our xmin, e.g. a table populated using many single-row inserts. But even if we hit this case, how bad is it really? I don't think XactTruncationLock is particularly hot, so maybe it just doesn't matter very much. We could contend against other sessions checking other tables, or against widespread use of pg_xact_status(), but I think that's about it. Another downside of this approach is that I'm not sure it does anything to help us with the MXID case; fixing that might require building some new infrastructure similar to XactTruncationLock but for MXIDs. 4. Provide entrypoints for looking up XIDs that fail gently instead of throwing errors. I've got my doubts about how practical this is; if it's easy, why didn't we do that instead of inventing XactTruncationLock? Maybe there are other options here, too? At the moment, I'm thinking that (2) and (4) are just bad and so we ought to either do (3) if it doesn't suck too much for performance (which I don't quite see why it should, but it might) or else fall back on (1). (1) doesn't feel clever enough but it might be better to be not clever enough than to be too clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company