Re: [HACKERS] TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

Robert Haas Tue, 25 Oct 2011 19:33:33 -0700

On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> What has happened here is that the second ANALYZE has marked itself
> committed in pg_clog and no longer running in the ProcArray, so VACUUM
> feels entitled to remove toast tuples that the ANALYZE deleted.  However,
> the ANALYZE has not yet sent out the sinval messages that would inform
> session 2 that its syscache entries are obsolete.  In Andrew's report,
> presumably the machine was under enough load to slow down ANALYZE at
> just this point, and there was a concurrent autovacuum that would have
> done the rest of the deed.  The problem could only be seen for a short
> interval, which squares with his report, and with a similar one from
> Tim Uckun back in September.
>
> Ordinarily, sending out sinval messages post-commit is okay because we
> don't release locks until after that, and we suppose that our locks
> prevent any other transactions from getting to the point of using
> syscache entries that might have been invalidated by our transaction.
> However, *we have carefully hacked on ANALYZE until it doesn't take any
> locks that would block concurrent queries on the analyzed table.*  So
> the normal protection against stale syscache entries simply doesn't
> work for pg_statistic fetches.


This is very similar to one of the issues that reared its ugly head in
regards to Simon's now-reverted patch to lower DDL locking strength.
You identified some other issues there as well, but *one* of the
issues was that, as in this case, the sinval mechanism fails to
provide the necessary synchronization guarantees unless the lock
required to reread the updated data conflicts with the lock required
to change the data.  In that case, "the data" meant "the pg_class
entry" or "the pg_attribute" entry whereas here it means "the
pg_statistic entry", but I believe the principal is the same.  And
there as here, (1) there is a fundamental conflict between what the
sinval mechanism requires for correctness and what is actually
desirable in terms of lock levels from a user experience point of view
and (2) it is relatively easy to write code that looks superficially
safe but which actually contains subtle race conditions.  IIRC, you
never thought Simon's patch looked safe, but I'm guessing that this
pg_statistic bug has been around for a long time.

So I'm wondering if we ought to rethink our position that users of the
sinval machinery must provide their own external synchronization
through heavyweight locking, and instead build the synchronization
into the sinval mechanism itself.  One idea I had was to include the
XID of the transaction sending the sinval mechanism in every message,
and to force clients receiving a message to do XactLockTableWait() for
each such XID.  That would force the backend reloading its cache to
wait until the committing transaction reaches the lock-release phase.
If we sent out sinval messages just before removing ourselves from the
ProcArray, I think that would more-or-less fix this bug (although
maybe I'm missing some reason why it's not practical to send them that
early) except that I don't see any way to handle the sinval-reset
case, which seems to more or less kill this idea in its tracks.

But maybe there's some other mechanism whereby we could combine
sending the sinval messages slightly earlier (before
ProcArrayEndTransaction) with blocking anyone who processes those
messages until after the committing backend finishes
ProcArrayEndTransaction.  For example, you could add an additional
LWLock, which has to be held in exclusive mode by a committing
transaction that sends any sinval messages.  It must be acquired
before sending the sinval messages and can't be released until after
ProcArrayEndTransaction() is complete.  Anyone processing a sinval
message must acquire and release the lock in shared mode before
reloading their caches, so that we guarantee that at the time you
reread the catalogs, any transactions involved in sending those
messages are visible.

That's actually a bit coarse-grained; there's probably a better
mechanism, but I'm just throwing this out to see if the basic idea has
any legs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

Reply via email to