On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Yeah, I thought about weakening the assertions too, but I couldn't > see a fix of that kind that didn't seem mighty ad-hoc.
I don't really see what's ad-hoc about skipping it in the case where a FATAL is in progress. I mean, skipping a sanity check only in the cases where we know it might pass - and are OK with the fact that it might not pass - seems to me to be an extremely difficult policy to argue against on rational grounds. That's just the definition of writing correct sanity checks. More concretely, the present example seems no different than the ResourceOwner stuff emitting warnings on commit and remaining silent on abort. We could make it complain on both commit and abort, but then it would fail spuriously because there's no other mechanism to release resources in the abort path, so we don't. Similarly here, we have every reason to expect the check to pass during ERROR recovery but there is no reason to expect it to pass during FATAL recovery, yet as coded we will do the test anyway. That's wrong. > Now, there is some room to argue that AtEOXact_CatCache() is just > obsolete and we should remove it altogether; I don't think it's > caught a real bug since we invented resowners in 8.1. Yeah, the same thought crossed my mind. IIUC, we'd crash if a catcache reference were acquired without CurrentResourceOwner being valid, so this is really just a belt-and-suspenders check. > But, again, the larger question to my mind is where else we may have > similar issues. There's certainly never been any methodical attempt > to see whether elog(FATAL) will work everywhere. IIRC, you raised a similar objection back when Bruce added pg_terminate_backend(), which caused that change to be briefly reverted before ultimately being put back. Despite the hardening you did back then, I think it's highly likely that bugs remain in that path, and I am of course not opposed to trying to improve the situation. That having been said, the bugs that remain are probably mostly quite low-probability, because otherwise we'd have found them by now. I think parallel query is likely to flush out a few of the ones that remain by creating a new class of backends that terminate after a single query and may get killed by the leader at any time; that's how we discovered this issue. Fuzz testing could be done, too, e.g. run something like sqlsmith simultaneously in many sessions while killing off backends at random. I'm also not deadly opposed to redesigning the whole mechanism either, but I think that should be approached with a measure of caution: it might end up reducing reliability rather than increasing it. I suggest in any case that we start with a surgical fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers