Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
daveg writes: > On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote: >> daveg writes: >>> The first version we saw it on was 8.4.7. >> Yeah, you said that. I was wondering what you'd last run before 8.4.7. > Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7 > since shortly after it was released. Prior to that we have had all the major > and most of the minor releases since 7.1. Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and I can't find anything that looks like it could possibly be related. So at this point I'm inclined to think that the bug is older than that, but your usage patterns changed so that you started to tickle it. Can you think of any changes in your usage that might date to around that time, and would somehow be connected to backend startup/shutdown? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote: > daveg writes: > > On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote: > >> BTW ... what were the last versions you were running on which you had > >> *not* seen the problem? (Just wondering about the possibility that we > >> back-patched some "fix" that broke things. It would be good to have > >> a version range before trawling the commit logs.) > > > The first version we saw it on was 8.4.7. > > Yeah, you said that. I was wondering what you'd last run before 8.4.7. Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7 since shortly after it was released. Prior to that we have had all the major and most of the minor releases since 7.1. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 7, 2011 at 6:25 PM, Tom Lane wrote: > Robert Haas writes: >> I thought about an error exit from client authentication, and that's a >> somewhat appealing explanation, but I can't quite see why we wouldn't >> clean up there the same as anywhere else. The whole mechanism feels a >> bit rickety to me - we don't actually release locks; we just abort the >> transaction and *assume* that will cause locks to get released. > > Well, transaction abort will call LockReleaseAll, which is carefully > coded to clean up the proclock lists regardless of what is in the > locallocks table, so I'm not sure why you find that any more rickety > than anything else. Well, it's very clear that you CAN orphan locks if a backend holding a session lock ever does CHECK_FOR_INTERRUPTS() outside of a transaction. Try the attached patch. rhaas=# vacuum full foo; FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command The connection to the server was lost. Attempting reset: Succeeded. rhaas=# vacuum full foo; ERROR: lock AccessExclusiveLock on object 16384/1431013/0 is already held Now, I don't see any evidence of a live bug here (and on further thought it can't be Dave's bug because he is orphaning AccessShareLocks, not AccessExclusiveLocks), but I find this pretty convincing as a demonstration of ricketiness. It is certainly not obvious on its face that a misplaced CHECK_FOR_INTERRUPTS() can result in a backend exiting without cleaning up its locks, and I'd argue its a bad idea to leave it that way even if there's no user-visible problem there today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company break-vacuum.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
daveg writes: > On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote: >> BTW ... what were the last versions you were running on which you had >> *not* seen the problem? (Just wondering about the possibility that we >> back-patched some "fix" that broke things. It would be good to have >> a version range before trawling the commit logs.) > The first version we saw it on was 8.4.7. Yeah, you said that. I was wondering what you'd last run before 8.4.7. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote: > daveg writes: > > Also, this is very intermittant, we have seen it only in recent months > > on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what > > feels like a few times a month. Possibly some new application behaviour > > is provoking it, but I have no guesses as to what. > > BTW ... what were the last versions you were running on which you had > *not* seen the problem? (Just wondering about the possibility that we > back-patched some "fix" that broke things. It would be good to have > a version range before trawling the commit logs.) The first version we saw it on was 8.4.7. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
daveg writes: > Also, this is very intermittant, we have seen it only in recent months > on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what > feels like a few times a month. Possibly some new application behaviour > is provoking it, but I have no guesses as to what. BTW ... what were the last versions you were running on which you had *not* seen the problem? (Just wondering about the possibility that we back-patched some "fix" that broke things. It would be good to have a version range before trawling the commit logs.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
daveg writes: > On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote: >> ... But maybe it'd be interesting for Dave to stick a >> LockReleaseAll call into ProcKill() and see if that makes things better. >> (Dave: test that before you put it in production, I'm not totally sure >> it's safe.) > Re safety, what is the worst case here? I think a failure would be pretty obvious --- if it gets through regression tests it's probably fine. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote: > Robert Haas writes: > > I thought about an error exit from client authentication, and that's a > > somewhat appealing explanation, but I can't quite see why we wouldn't > > clean up there the same as anywhere else. The whole mechanism feels a > > bit rickety to me - we don't actually release locks; we just abort the > > transaction and *assume* that will cause locks to get released. > > Well, transaction abort will call LockReleaseAll, which is carefully > coded to clean up the proclock lists regardless of what is in the > locallocks table, so I'm not sure why you find that any more rickety > than anything else. But maybe it'd be interesting for Dave to stick a > LockReleaseAll call into ProcKill() and see if that makes things better. > (Dave: test that before you put it in production, I'm not totally sure > it's safe.) Re safety, what is the worst case here? Also, this is very intermittant, we have seen it only in recent months on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what feels like a few times a month. Possibly some new application behaviour is provoking it, but I have no guesses as to what. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 06:35:08PM -0400, Tom Lane wrote: > daveg writes: > > It does not seem restricted to pg_authid: > > 2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on > > object 16403/2615/0 > > And I think I've seen it on other tables too. > > Hmm. 2615 = pg_namespace, which most likely is the first catalog > accessed by just about any SQL command that's going to access tables at > all, so I suspect that this is mostly just a "the first access failed" > thing and not something peculiar to pg_namespace. But we still don't > have a clue why the locks are not getting released by the previous > owner of the PGPROC slot. Have you trawled your logs to see if there's > any sign of any distress at all, shortly before the problem starts to > happen? Will do, but its a pretty big haystack. Sure wish I knew what the needle looked like. ;-) -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
daveg writes: > It does not seem restricted to pg_authid: > 2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on > object 16403/2615/0 > And I think I've seen it on other tables too. Hmm. 2615 = pg_namespace, which most likely is the first catalog accessed by just about any SQL command that's going to access tables at all, so I suspect that this is mostly just a "the first access failed" thing and not something peculiar to pg_namespace. But we still don't have a clue why the locks are not getting released by the previous owner of the PGPROC slot. Have you trawled your logs to see if there's any sign of any distress at all, shortly before the problem starts to happen? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
Robert Haas writes: > I thought about an error exit from client authentication, and that's a > somewhat appealing explanation, but I can't quite see why we wouldn't > clean up there the same as anywhere else. The whole mechanism feels a > bit rickety to me - we don't actually release locks; we just abort the > transaction and *assume* that will cause locks to get released. Well, transaction abort will call LockReleaseAll, which is carefully coded to clean up the proclock lists regardless of what is in the locallocks table, so I'm not sure why you find that any more rickety than anything else. But maybe it'd be interesting for Dave to stick a LockReleaseAll call into ProcKill() and see if that makes things better. (Dave: test that before you put it in production, I'm not totally sure it's safe.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 04:55:24PM -0400, Tom Lane wrote: > Robert Haas writes: > > Tom's right to be skeptical of my theory, because it would require a > > CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the > > pathways that use session-level locks, and I can't find one. > > More to the point: session-level locks are released on error. The only > way to get out of a transaction while still holding one is for the > VACUUM-or-whichever-command code to deliberately commit and exit while > still holding it. An error exit path would release the lock. > > > OTOH, I'm skeptical of the theory that this involves userlocks, > > because this whole thread started because of a complaint about lock > > 0/1260/0 already being held. That ain't no userlock. > > Yeah, and for that matter it seems to let VACUUM off the hook too. > If we assume that the reported object ID is non-corrupt (and if it's > always the same, that seems like the way to bet) then this is a lock > on pg_authid. > > Hmmm ... could the pathway involve an error exit from client > authentication? We're still finding bugs in the 9.0 rewrite of > auth-time database access. It does not seem restricted to pg_authid: 2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on object 16403/2615/0 And I think I've seen it on other tables too. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 7, 2011 at 4:55 PM, Tom Lane wrote: > Yeah, and for that matter it seems to let VACUUM off the hook too. > If we assume that the reported object ID is non-corrupt (and if it's > always the same, that seems like the way to bet) then this is a lock > on pg_authid. > > Hmmm ... could the pathway involve an error exit from client > authentication? We're still finding bugs in the 9.0 rewrite of > auth-time database access. Well, according to Dave's report upthread, it's not only this one relation: DG> The recent errors are: DG> lock AccessShareLock on object 16403/2615/0 is already held DG> which is for pg_namespace in database c23. I thought about an error exit from client authentication, and that's a somewhat appealing explanation, but I can't quite see why we wouldn't clean up there the same as anywhere else. The whole mechanism feels a bit rickety to me - we don't actually release locks; we just abort the transaction and *assume* that will cause locks to get released. And it should. But there's a bit more action-at-a-distance there than I'd ideally like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
Robert Haas writes: > Tom's right to be skeptical of my theory, because it would require a > CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the > pathways that use session-level locks, and I can't find one. More to the point: session-level locks are released on error. The only way to get out of a transaction while still holding one is for the VACUUM-or-whichever-command code to deliberately commit and exit while still holding it. An error exit path would release the lock. > OTOH, I'm skeptical of the theory that this involves userlocks, > because this whole thread started because of a complaint about lock > 0/1260/0 already being held. That ain't no userlock. Yeah, and for that matter it seems to let VACUUM off the hook too. If we assume that the reported object ID is non-corrupt (and if it's always the same, that seems like the way to bet) then this is a lock on pg_authid. Hmmm ... could the pathway involve an error exit from client authentication? We're still finding bugs in the 9.0 rewrite of auth-time database access. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 7, 2011 at 4:22 PM, daveg wrote: > Yes, we make extensive use of advisory locks. That was my thought too when > Robert mentioned session level locks. > > I'm happy to add any additional instrumentation, but my client would be > happier to actually run it if there was a way to recover from this without > an unplanned outage. Is there something I can do when the patch detects the > problem to be able to continue without a restart? Well, basically, you want to do the same thing that LockRelease() would do - remove the PROCLOCK from the SHM_QUEUE and delete it from the hash table, adjusting the counts in the LOCK object as appropriate. If you just ignore the failure, you'll get the "blah blah blah is already held" messages you were having before. Tom's right to be skeptical of my theory, because it would require a CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the pathways that use session-level locks, and I can't find one. OTOH, I'm skeptical of the theory that this involves userlocks, because this whole thread started because of a complaint about lock 0/1260/0 already being held. That ain't no userlock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 07, 2011 at 10:20:24AM -0400, Tom Lane wrote: > Robert Haas writes: > > After spending some time staring at the code, I do have one idea as to > > what might be going on here. When a backend is terminated, > > ShutdownPostgres() calls AbortOutOfAnyTransaction() and then > > LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all > > user locks, and the first one (or so we suppose) releases all normal > > locks as part of aborting the transaction. But what if there's no > > transaction in progress? In that case, AbortOutOfAnyTransaction() > > won't do anything - which is fine as far as transaction-level locks > > go, because we shouldn't be holding any of those anyway if there's no > > transaction in progress. However, if we hold a session-level lock at > > that point, then we'd orphan it. We don't make much use of session > > locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE > > INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on > > standby servers, redo of DROP DATABASE actions. Any chance one of > > those died or was killed off around the time this happened? > > I don't believe this theory at all, because if that were the issue, > we'd have heard about it long since. The correct theory has to involve > a very-little-used triggering condition. At the moment I'm wondering > about advisory (userspace) locks ... Dave, do your apps use any of those? Yes, we make extensive use of advisory locks. That was my thought too when Robert mentioned session level locks. I'm happy to add any additional instrumentation, but my client would be happier to actually run it if there was a way to recover from this without an unplanned outage. Is there something I can do when the patch detects the problem to be able to continue without a restart? Is is save to just reset the proclock queue? I don't think they would mind leaking locks, for instance, and a later planned restart to clear it up as much as they mind unscheduled downtime. Thank -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
Robert Haas writes: > After spending some time staring at the code, I do have one idea as to > what might be going on here. When a backend is terminated, > ShutdownPostgres() calls AbortOutOfAnyTransaction() and then > LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all > user locks, and the first one (or so we suppose) releases all normal > locks as part of aborting the transaction. But what if there's no > transaction in progress? In that case, AbortOutOfAnyTransaction() > won't do anything - which is fine as far as transaction-level locks > go, because we shouldn't be holding any of those anyway if there's no > transaction in progress. However, if we hold a session-level lock at > that point, then we'd orphan it. We don't make much use of session > locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE > INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on > standby servers, redo of DROP DATABASE actions. Any chance one of > those died or was killed off around the time this happened? I don't believe this theory at all, because if that were the issue, we'd have heard about it long since. The correct theory has to involve a very-little-used triggering condition. At the moment I'm wondering about advisory (userspace) locks ... Dave, do your apps use any of those? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Wed, Sep 7, 2011 at 5:16 AM, daveg wrote: > On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote: >> On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote: >> > So far I've got: >> > >> > - affects system tables >> > - happens very soon after process startup >> > - in 8.4.7 and 9.0.4 >> > - not likely to be hardware or OS related >> > - happens in clusters for period of a few second to many minutes >> > >> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's >> > hard to get downtime to pick up new builds. Any other ideas on getting to >> > the bottom of this? >> >> I've been thinking this one over, and doing a little testing. I'm >> still stumped, but I have a few thoughts. What that error message is >> really saying is that the LOCALLOCK bookkeeping doesn't match the >> PROCLOCK bookkeeping; it doesn't tell us which one is to blame. > ... >> My second thought is that perhaps a process is occasionally managing >> to exit without fully cleaning up the associated PROCLOCK entry. At >> first glance, it appears that this would explain the observed >> symptoms. A new backend gets the PGPROC belonging to the guy who >> didn't clean up after himself, hits the error, and disconnects, >> sticking himself right back on to the head of the SHM_QUEUE where the >> next connection will inherit the same PGPROC and hit the same problem. >> But it's not clear to me what could cause the system to get into this >> state in the first place, or how it would eventually right itself. >> >> It might be worth kludging up your system to add a test to >> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are >> either NULL or empty, along the lines of the attached patch (which >> assumes that assertions are enabled; otherwise, put in an elog() of >> some sort). Actually, I wonder if we shouldn't move all the >> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than >> doing it over again every time someone calls InitProcess(). Besides >> being a waste of cycles, it's probably less robust this way. If >> there somehow are leftovers in one of those queues, the next >> successful call to LockReleaseAll() ought to clean up the mess, but of >> course there's no chance of that working if we've nuked the queue >> pointers. > > I did this in the elog flavor as we don't build production images with > asserts. > It has been running on all hosts for a few days. Today it hit the extra > checks in initproc. > > 00:02:32.782 8626 [unknown] [unknown] LOG: connection received: host=bk0 > port=42700 > 00:02:32.783 8627 [unknown] [unknown] LOG: connection received: host=op2 > port=45876 > 00:02:32.783 8627 d61 apps FATAL: Initprocess myProclocks[4] not empty: > queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) > 00:02:32.783 8626 d35 postgres LOG: connection authorized: user=postgres > database=c35 > 00:02:32.783 21535 LOG: server process (PID 8627) exited with exit code 1 > 00:02:32.783 21535 LOG: terminating any other active server processes > 00:02:32.783 8626 c35 postgres WARNING: terminating connection because of > crash of another server process > > The patch that produced this is attached. If you can think of anything I > can add to this to help I'd be happy to do so. Also, can I clean this up > and continue somehow? Maybe clear the queue instead having to have a restart? > Or is there a way to just pause this proc here, maybe mark it not to be used > and exit, or just to sleep forever so I can debug later? I think what we really need to know here is: when the debugging code you injected here fires, what happened to the previous owner of that PGPROC that caused it to not clean up its state properly? The PGPROC that 8627 inherited in the above example is obviously messed up - but what did the last guy do that messed it up? It would be nice to log the address of the PGPROC in every log message here - then you could go back and look to see whether the last guy terminated in some unusual way. In the meantime, could you could look back a couple of minutes from the time of the above occurrence and see if there are any FATAL errors, or usual ERRORs? After spending some time staring at the code, I do have one idea as to what might be going on here. When a backend is terminated, ShutdownPostgres() calls AbortOutOfAnyTransaction() and then LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all user locks, and the first one (or so we suppose) releases all normal locks as part of aborting the transaction. But what if there's no transaction in progress? In that case, AbortOutOfAnyTransaction() won't do anything - which is fine as far as transaction-level locks go, because we shouldn't be holding any of those anyway if there's no transaction in progress. However, if we hold a session-level lock at that point, then we'd orphan it. We don't make much use of session locks. As far as I can see, they are used by (1) VACUUM, (2) CRE
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote: > On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote: > > So far I've got: > > > > - affects system tables > > - happens very soon after process startup > > - in 8.4.7 and 9.0.4 > > - not likely to be hardware or OS related > > - happens in clusters for period of a few second to many minutes > > > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's > > hard to get downtime to pick up new builds. Any other ideas on getting to > > the bottom of this? > > I've been thinking this one over, and doing a little testing. I'm > still stumped, but I have a few thoughts. What that error message is > really saying is that the LOCALLOCK bookkeeping doesn't match the > PROCLOCK bookkeeping; it doesn't tell us which one is to blame. ... > My second thought is that perhaps a process is occasionally managing > to exit without fully cleaning up the associated PROCLOCK entry. At > first glance, it appears that this would explain the observed > symptoms. A new backend gets the PGPROC belonging to the guy who > didn't clean up after himself, hits the error, and disconnects, > sticking himself right back on to the head of the SHM_QUEUE where the > next connection will inherit the same PGPROC and hit the same problem. > But it's not clear to me what could cause the system to get into this > state in the first place, or how it would eventually right itself. > > It might be worth kludging up your system to add a test to > InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are > either NULL or empty, along the lines of the attached patch (which > assumes that assertions are enabled; otherwise, put in an elog() of > some sort). Actually, I wonder if we shouldn't move all the > SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than > doing it over again every time someone calls InitProcess(). Besides > being a waste of cycles, it's probably less robust this way. If > there somehow are leftovers in one of those queues, the next > successful call to LockReleaseAll() ought to clean up the mess, but of > course there's no chance of that working if we've nuked the queue > pointers. I did this in the elog flavor as we don't build production images with asserts. It has been running on all hosts for a few days. Today it hit the extra checks in initproc. 00:02:32.782 8626 [unknown] [unknown] LOG: connection received: host=bk0 port=42700 00:02:32.783 8627 [unknown] [unknown] LOG: connection received: host=op2 port=45876 00:02:32.783 8627 d61 apps FATAL: Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 00:02:32.783 8626 d35 postgres LOG: connection authorized: user=postgres database=c35 00:02:32.783 21535 LOG: server process (PID 8627) exited with exit code 1 00:02:32.783 21535 LOG: terminating any other active server processes 00:02:32.783 8626 c35 postgres WARNING: terminating connection because of crash of another server process The patch that produced this is attached. If you can think of anything I can add to this to help I'd be happy to do so. Also, can I clean this up and continue somehow? Maybe clear the queue instead having to have a restart? Or is there a way to just pause this proc here, maybe mark it not to be used and exit, or just to sleep forever so I can debug later? Thanks -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. --- postgresql-9.0.4/src/backend/storage/lmgr/proc.c2011-04-14 20:15:53.0 -0700 +++ postgresql-9.0.4.dg/src/backend/storage/lmgr/proc.c 2011-08-23 17:30:03.505176019 -0700 @@ -323,7 +323,15 @@ MyProc->waitLock = NULL; MyProc->waitProcLock = NULL; for (i = 0; i < NUM_LOCK_PARTITIONS; i++) + { + SHM_QUEUE *queue = &(MyProc->myProcLocks[i]); + if (! (!queue->prev || queue->prev == queue || + !queue->next || queue->next == queue) + ) + elog(FATAL, "Initprocess myProclocks[%d] not empty: queue %p (prev %p, next %p) ", + i, queue, queue->prev, queue->next); SHMQueueInit(&(MyProc->myProcLocks[i])); + } MyProc->recoveryConflictPending = false; /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote: > So far I've got: > > - affects system tables > - happens very soon after process startup > - in 8.4.7 and 9.0.4 > - not likely to be hardware or OS related > - happens in clusters for period of a few second to many minutes > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's > hard to get downtime to pick up new builds. Any other ideas on getting to > the bottom of this? I've been thinking this one over, and doing a little testing. I'm still stumped, but I have a few thoughts. What that error message is really saying is that the LOCALLOCK bookkeeping doesn't match the PROCLOCK bookkeeping; it doesn't tell us which one is to blame. My first thought was that there might be some situation where LockAcquireExtended() gets an interrupt between the time it does the LOCALLOCK lookup and the time it acquires the partition lock. If the interrupt handler were to acquire (but not releases) a lock in the meantime, then we'd get confused. However, I can't see how that's possible. I inserted some debugging code to fail an assertion if CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if ImmediateInterruptOK is set on entering the function, and the system still passes regression tests. My second thought is that perhaps a process is occasionally managing to exit without fully cleaning up the associated PROCLOCK entry. At first glance, it appears that this would explain the observed symptoms. A new backend gets the PGPROC belonging to the guy who didn't clean up after himself, hits the error, and disconnects, sticking himself right back on to the head of the SHM_QUEUE where the next connection will inherit the same PGPROC and hit the same problem. But it's not clear to me what could cause the system to get into this state in the first place, or how it would eventually right itself. It might be worth kludging up your system to add a test to InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are either NULL or empty, along the lines of the attached patch (which assumes that assertions are enabled; otherwise, put in an elog() of some sort). Actually, I wonder if we shouldn't move all the SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than doing it over again every time someone calls InitProcess(). Besides being a waste of cycles, it's probably less robust this way. If there somehow are leftovers in one of those queues, the next successful call to LockReleaseAll() ought to clean up the mess, but of course there's no chance of that working if we've nuked the queue pointers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company initprocess-assert.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held
On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote: > > This seems to be bug month for my client. Now there are seeing periods > where all new connections fail immediately with the error: > >FATAL: lock AccessShareLock on object 0/1260/0 is already held > > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has > been up for months. It started happening sporadicly a few days ago. It will > do this for a period of several minutes to an hour and then go back to > normal for hours or days. > > One complete failing session out of several hundred around that time: > - > 2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection > received: host=op05.xxx port=34067 > 2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: > user=apps database=c77 > 2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on > object 0/1260/0 is already held > -- This is to add additional information to the original report: For a while this was happening on many different databases in one postgresql 8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5. That has been quiet for several days and the newest batches of errors have happened on only on a single database 'c23', in a postgresql 9.0.4 instance on a smaller host ('A', 64GB 8cpu) running SuSE 10.2. No memory errors or other misbehaviour have been seen on either of these hosts in recent months. The original error was: lock AccessShareLock on object 0/1260/0 is already held which is for pg_database. The recent errors are: lock AccessShareLock on object 16403/2615/0 is already held which is for pg_namespace in database c23. All of the orginal and most of the recent batchs of errors were immediately after connecting to a database and being authorized, that is, before any statements were attempted. However, some of the most recent are on the first "query" statement. That is after logging in and doing things like "set transaction ... " the first select would hit this error. It seems to come in clusters, sometimes, which suggests something shared by multiple processes. For example, here are the times for the errors on c23 in the afternoon of August 20: 20 07:14:12.722 20 16:05:07.798 20 16:05:07.808 20 16:05:10.519 20 16:07:07.726 20 16:07:08.722 20 16:07:09.734 20 16:07:10.656 20 16:07:25.436 20 16:22:23.983 20 16:22:24.014 20 16:22:24.335 20 16:22:24.409 20 16:22:24.477 20 16:22:24.499 20 16:22:24.516 20 16:30:58.210 20 16:31:15.261 20 16:31:15.296 20 16:31:15.324 20 16:31:15.348 20 18:06:16.515 20 18:06:49.198 20 18:06:49.204 20 18:06:51.444 20 21:03:05.940 So far I've got: - affects system tables - happens very soon after process startup - in 8.4.7 and 9.0.4 - not likely to be hardware or OS related - happens in clusters for period of a few second to many minutes I'll work on printing the LOCK and LOCALLOCK when it happens, but it's hard to get downtime to pick up new builds. Any other ideas on getting to the bottom of this? Thanks -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers