On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch <n...@leadboat.com> wrote: > A PANIC will reinitialize everything relevant, largely resolving the problems > around ERROR during FATAL. It's a heavy-handed solution, but it may well be > the best solution. Efforts to harden CommitTransaction() and > AbortTransaction() seem well-spent, but the additional effort to make FATAL > exit cope where AbortTransaction() or another exit action could not cope seems > to be slicing ever-smaller portions of additional robustness. > > I pondered a variant of that conclusion that distinguished critical cleanup > needs from the rest. Each shared resource (heavyweight locks, buffer pins, > LWLocks) would have an on_shmem_exit() callback that cleans up the resource > under a critical section. (AtProcExit_Buffers() used to fill such a role, but > resowner.c's work during AbortTransaction() has mostly supplanted it.) The > ShutdownPostgres callback would not use a critical section, so lesser failures > in AbortTransaction() would not upgrade to a PANIC. But I'm leaning against > such a complication on the grounds that it would add seldom-tested code paths > posing as much a chance of eroding robustness as bolstering it.
The current situation is just plain weird: in the ERROR-then-ERROR case, we emit a WARNING and bounce right back into AbortTransaction(), and if it doesn't work any better the second time than the first time, we recurse again, and eventually if it fails enough times in a row, we just give up and PANIC. But in the ERROR-then-FATAL case, we *don't* retry AbortTransaction(); instead, we just continue running the rest of the on_shmem_exit callbacks and then exit. So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL. That's bizarre. Given that that's where we are, promoting an ERROR during FATAL processing to PANIC doesn't seem like it's losing much; we're essentially already doing that in the (probably more likely) case of a persistent ERROR during ERROR processing. But since PANIC sucks, I'd rather go the other direction: let's make an ERROR during ERROR processing promote to FATAL. And then let's do what you write above: make sure that there's a separate on-shmem-exit callback for each critical shared memory resource and that we call of those during FATAL processing. It seems to me that that's how things were originally designed to work, but that we've drifted away from it basically because the low-level callbacks to release heavyweight locks and buffer pins turned out to be kinda, uh, slow, and we thought those code paths couldn't be taken anyway (turns out they can). I think we could either make those routines very fast, or arrange only to run that code at all in the case where AbortTransaction() didn't complete successfully. It's true that such code will be rarely run, but the logic is simple enough that I think we can verify it by hand, and it's sure nice to avoid PANICs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers