On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > What has happened here is that the second ANALYZE has marked itself > committed in pg_clog and no longer running in the ProcArray, so VACUUM > feels entitled to remove toast tuples that the ANALYZE deleted. However, > the ANALYZE has not yet sent out the sinval messages that would inform > session 2 that its syscache entries are obsolete. In Andrew's report, > presumably the machine was under enough load to slow down ANALYZE at > just this point, and there was a concurrent autovacuum that would have > done the rest of the deed. The problem could only be seen for a short > interval, which squares with his report, and with a similar one from > Tim Uckun back in September. > > Ordinarily, sending out sinval messages post-commit is okay because we > don't release locks until after that, and we suppose that our locks > prevent any other transactions from getting to the point of using > syscache entries that might have been invalidated by our transaction. > However, *we have carefully hacked on ANALYZE until it doesn't take any > locks that would block concurrent queries on the analyzed table.* So > the normal protection against stale syscache entries simply doesn't > work for pg_statistic fetches.
This is very similar to one of the issues that reared its ugly head in regards to Simon's now-reverted patch to lower DDL locking strength. You identified some other issues there as well, but *one* of the issues was that, as in this case, the sinval mechanism fails to provide the necessary synchronization guarantees unless the lock required to reread the updated data conflicts with the lock required to change the data. In that case, "the data" meant "the pg_class entry" or "the pg_attribute" entry whereas here it means "the pg_statistic entry", but I believe the principal is the same. And there as here, (1) there is a fundamental conflict between what the sinval mechanism requires for correctness and what is actually desirable in terms of lock levels from a user experience point of view and (2) it is relatively easy to write code that looks superficially safe but which actually contains subtle race conditions. IIRC, you never thought Simon's patch looked safe, but I'm guessing that this pg_statistic bug has been around for a long time. So I'm wondering if we ought to rethink our position that users of the sinval machinery must provide their own external synchronization through heavyweight locking, and instead build the synchronization into the sinval mechanism itself. One idea I had was to include the XID of the transaction sending the sinval mechanism in every message, and to force clients receiving a message to do XactLockTableWait() for each such XID. That would force the backend reloading its cache to wait until the committing transaction reaches the lock-release phase. If we sent out sinval messages just before removing ourselves from the ProcArray, I think that would more-or-less fix this bug (although maybe I'm missing some reason why it's not practical to send them that early) except that I don't see any way to handle the sinval-reset case, which seems to more or less kill this idea in its tracks. But maybe there's some other mechanism whereby we could combine sending the sinval messages slightly earlier (before ProcArrayEndTransaction) with blocking anyone who processes those messages until after the committing backend finishes ProcArrayEndTransaction. For example, you could add an additional LWLock, which has to be held in exclusive mode by a committing transaction that sends any sinval messages. It must be acquired before sending the sinval messages and can't be released until after ProcArrayEndTransaction() is complete. Anyone processing a sinval message must acquire and release the lock in shared mode before reloading their caches, so that we guarantee that at the time you reread the catalogs, any transactions involved in sending those messages are visible. That's actually a bit coarse-grained; there's probably a better mechanism, but I'm just throwing this out to see if the basic idea has any legs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers