On Mon, Aug 22, 2011 at 3:31 AM, daveg <da...@sonic.net> wrote: > So far I've got: > > - affects system tables > - happens very soon after process startup > - in 8.4.7 and 9.0.4 > - not likely to be hardware or OS related > - happens in clusters for period of a few second to many minutes > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's > hard to get downtime to pick up new builds. Any other ideas on getting to > the bottom of this?
I've been thinking this one over, and doing a little testing. I'm still stumped, but I have a few thoughts. What that error message is really saying is that the LOCALLOCK bookkeeping doesn't match the PROCLOCK bookkeeping; it doesn't tell us which one is to blame. My first thought was that there might be some situation where LockAcquireExtended() gets an interrupt between the time it does the LOCALLOCK lookup and the time it acquires the partition lock. If the interrupt handler were to acquire (but not releases) a lock in the meantime, then we'd get confused. However, I can't see how that's possible. I inserted some debugging code to fail an assertion if CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if ImmediateInterruptOK is set on entering the function, and the system still passes regression tests. My second thought is that perhaps a process is occasionally managing to exit without fully cleaning up the associated PROCLOCK entry. At first glance, it appears that this would explain the observed symptoms. A new backend gets the PGPROC belonging to the guy who didn't clean up after himself, hits the error, and disconnects, sticking himself right back on to the head of the SHM_QUEUE where the next connection will inherit the same PGPROC and hit the same problem. But it's not clear to me what could cause the system to get into this state in the first place, or how it would eventually right itself. It might be worth kludging up your system to add a test to InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are either NULL or empty, along the lines of the attached patch (which assumes that assertions are enabled; otherwise, put in an elog() of some sort). Actually, I wonder if we shouldn't move all the SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than doing it over again every time someone calls InitProcess(). Besides being a waste of cycles, it's probably less robust this way. If there somehow are leftovers in one of those queues, the next successful call to LockReleaseAll() ought to clean up the mess, but of course there's no chance of that working if we've nuked the queue pointers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
initprocess-assert.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers