On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Somebody inserted this into vacuum.c's get_rel_oids(): > > tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); > if (!HeapTupleIsValid(tuple)) > elog(ERROR, "cache lookup failed for relation %u", relid); > > apparently without having read the very verbose comment two lines above, > which points out that we're not taking any lock on the target relation. > So, if that relation is concurrently being dropped, you're likely to > get "cache lookup failed for relation NNNN" rather than anything more > user-friendly.
This has been overlooked during the lookups of 3c3bb99, and by multiple people including me. elog() should never be things users can face, as well as cache lookups. > A minimum-change fix would be to replace the elog() with an ereport > that produces the same "relation does not exist" error you'd have > gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed > a few microseconds earlier. But that feels like its's band-aiding > around the problem. Yeah, that's not right. This is a cache lookup error at the end. > What I'm wondering about is changing the RangeVarGetRelid call to take > ShareUpdateExclusiveLock rather than no lock. That would protect the > syscache lookup, and it would also make the find_all_inheritors call > a lot more meaningful. > > If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped > as soon as we close the caller's transaction, and then we'd acquire > the same or stronger lock inside vacuum_rel(). So that seems fine. > If we're doing an ANALYZE, then the lock would continue to be held > and analyze_rel would merely be acquiring it an extra time, so we'd > actually be removing a race-condition failure scenario for ANALYZE. > This would mean a few more cycles in lock management, but since this > only applies to a manual VACUUM or ANALYZE that specifies a table > name, I'm not too concerned about that. I think that I am +1 on that. Testing such a thing I am not seeing anything wrong either. The call to find_all_inheritors should also use ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid() needs to be reworked. Attached is a proposal of patch. > Thoughts? As long as I don't forget... Another thing currently on HEAD and REL_10_STABLE is that OIDs of partitioned tables are used, but the RangeVar of the parent is used for error reports. This leads to incorrect reports if a partition gets away in the middle of autovacuum as only information about the parent is reported to the user. I am not saying that this needs to be fixed in REL_10_STABLE at this stage though as this would require some refactoring similar to what the patch adding support for VACUUM with multiple relations does. But I digress here. -- Michael
vacuum-partition-locks.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers