Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-08 Thread daveg
On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
  On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
  BTW ... what were the last versions you were running on which you had
  *not* seen the problem?  (Just wondering about the possibility that we
  back-patched some fix that broke things.  It would be good to have
  a version range before trawling the commit logs.)
 
  The first version we saw it on was 8.4.7.
 
 Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
since shortly after it was released. Prior to that we have had all the major
and most of the minor releases since 7.1.
 
-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-08 Thread Tom Lane
daveg da...@sonic.net writes:
 On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
 The first version we saw it on was 8.4.7.

 Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

 Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
 since shortly after it was released. Prior to that we have had all the major
 and most of the minor releases since 7.1.

Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
I can't find anything that looks like it could possibly be related.
So at this point I'm inclined to think that the bug is older than
that, but your usage patterns changed so that you started to tickle it.

Can you think of any changes in your usage that might date to around
that time, and would somehow be connected to backend startup/shutdown?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
 On Mon, Aug 22, 2011 at 3:31 AM, daveg da...@sonic.net wrote:
  So far I've got:
 
   - affects system tables
   - happens very soon after process startup
   - in 8.4.7 and 9.0.4
   - not likely to be hardware or OS related
   - happens in clusters for period of a few second to many minutes
 
  I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
  hard to get downtime to pick up new builds. Any other ideas on getting to
  the bottom of this?
 
 I've been thinking this one over, and doing a little testing. I'm
 still stumped, but I have a few thoughts.  What that error message is
 really saying is that the LOCALLOCK bookkeeping doesn't match the
 PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
... 
 My second thought is that perhaps a process is occasionally managing
 to exit without fully cleaning up the associated PROCLOCK entry.  At
 first glance, it appears that this would explain the observed
 symptoms.  A new backend gets the PGPROC belonging to the guy who
 didn't clean up after himself, hits the error, and disconnects,
 sticking himself right back on to the head of the SHM_QUEUE where the
 next connection will inherit the same PGPROC and hit the same problem.
  But it's not clear to me what could cause the system to get into this
 state in the first place, or how it would eventually right itself.
 
 It might be worth kludging up your system to add a test to
 InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
 either NULL or empty, along the lines of the attached patch (which
 assumes that assertions are enabled; otherwise, put in an elog() of
 some sort).  Actually, I wonder if we shouldn't move all the
 SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
 doing it over again every time someone calls InitProcess().  Besides
 being a waste of cycles, it's probably less robust this way.   If
 there somehow are leftovers in one of those queues, the next
 successful call to LockReleaseAll() ought to clean up the mess, but of
 course there's no chance of that working if we've nuked the queue
 pointers.

I did this in the elog flavor as we don't build production images with asserts.
It has been running on all hosts for a few days. Today it hit the extra
checks in initproc.

00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 
port=42700
00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 
port=45876
00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: 
queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 
00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres 
database=c35
00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
00:02:32.783  21535  LOG:  terminating any other active server processes
00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of 
crash of another server process

The patch that produced this is attached. If you can think of anything I
can add to this to help I'd be happy to do so. Also, can I clean this up
and continue somehow? Maybe clear the queue instead having to have a restart?
Or is there a way to just pause this proc here, maybe mark it not to be used
and exit, or just to sleep forever so I can debug later?

Thanks

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.
--- postgresql-9.0.4/src/backend/storage/lmgr/proc.c2011-04-14 
20:15:53.0 -0700
+++ postgresql-9.0.4.dg/src/backend/storage/lmgr/proc.c 2011-08-23 
17:30:03.505176019 -0700
@@ -323,7 +323,15 @@
MyProc-waitLock = NULL;
MyProc-waitProcLock = NULL;
for (i = 0; i  NUM_LOCK_PARTITIONS; i++)
+   {
+   SHM_QUEUE *queue = (MyProc-myProcLocks[i]);
+   if (! (!queue-prev || queue-prev == queue ||
+  !queue-next || queue-next == queue)
+   )
+   elog(FATAL, Initprocess myProclocks[%d] not empty: 
queue %p (prev %p, next %p) ,
+   i, queue, queue-prev, queue-next);
SHMQueueInit((MyProc-myProcLocks[i]));
+   }
MyProc-recoveryConflictPending = false;
 
/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 5:16 AM, daveg da...@sonic.net wrote:
 On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
 On Mon, Aug 22, 2011 at 3:31 AM, daveg da...@sonic.net wrote:
  So far I've got:
 
   - affects system tables
   - happens very soon after process startup
   - in 8.4.7 and 9.0.4
   - not likely to be hardware or OS related
   - happens in clusters for period of a few second to many minutes
 
  I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
  hard to get downtime to pick up new builds. Any other ideas on getting to
  the bottom of this?

 I've been thinking this one over, and doing a little testing. I'm
 still stumped, but I have a few thoughts.  What that error message is
 really saying is that the LOCALLOCK bookkeeping doesn't match the
 PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
 ...
 My second thought is that perhaps a process is occasionally managing
 to exit without fully cleaning up the associated PROCLOCK entry.  At
 first glance, it appears that this would explain the observed
 symptoms.  A new backend gets the PGPROC belonging to the guy who
 didn't clean up after himself, hits the error, and disconnects,
 sticking himself right back on to the head of the SHM_QUEUE where the
 next connection will inherit the same PGPROC and hit the same problem.
  But it's not clear to me what could cause the system to get into this
 state in the first place, or how it would eventually right itself.

 It might be worth kludging up your system to add a test to
 InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
 either NULL or empty, along the lines of the attached patch (which
 assumes that assertions are enabled; otherwise, put in an elog() of
 some sort).  Actually, I wonder if we shouldn't move all the
 SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
 doing it over again every time someone calls InitProcess().  Besides
 being a waste of cycles, it's probably less robust this way.   If
 there somehow are leftovers in one of those queues, the next
 successful call to LockReleaseAll() ought to clean up the mess, but of
 course there's no chance of that working if we've nuked the queue
 pointers.

 I did this in the elog flavor as we don't build production images with 
 asserts.
 It has been running on all hosts for a few days. Today it hit the extra
 checks in initproc.

 00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 
 port=42700
 00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 
 port=45876
 00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: 
 queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
 00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres 
 database=c35
 00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
 00:02:32.783  21535  LOG:  terminating any other active server processes
 00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of 
 crash of another server process

 The patch that produced this is attached. If you can think of anything I
 can add to this to help I'd be happy to do so. Also, can I clean this up
 and continue somehow? Maybe clear the queue instead having to have a restart?
 Or is there a way to just pause this proc here, maybe mark it not to be used
 and exit, or just to sleep forever so I can debug later?

I think what we really need to know here is: when the debugging code
you injected here fires, what happened to the previous owner of that
PGPROC that caused it to not clean up its state properly?  The PGPROC
that 8627 inherited in the above example is obviously messed up - but
what did the last guy do that messed it up?  It would be nice to log
the address of the PGPROC in every log message here - then you could
go back and look to see whether the last guy terminated in some
unusual way.  In the meantime, could you could look back a couple of
minutes from the time of the above occurrence and see if there are any
FATAL errors, or usual ERRORs?

After spending some time staring at the code, I do have one idea as to
what might be going on here.  When a backend is terminated,
ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
user locks, and the first one (or so we suppose) releases all normal
locks as part of aborting the transaction.  But what if there's no
transaction in progress?  In that case, AbortOutOfAnyTransaction()
won't do anything - which is fine as far as transaction-level locks
go, because we shouldn't be holding any of those anyway if there's no
transaction in progress.  However, if we hold a session-level lock at
that point, then we'd orphan it.  We don't make much use of session
locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
standby 

Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 After spending some time staring at the code, I do have one idea as to
 what might be going on here.  When a backend is terminated,
 ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
 LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
 user locks, and the first one (or so we suppose) releases all normal
 locks as part of aborting the transaction.  But what if there's no
 transaction in progress?  In that case, AbortOutOfAnyTransaction()
 won't do anything - which is fine as far as transaction-level locks
 go, because we shouldn't be holding any of those anyway if there's no
 transaction in progress.  However, if we hold a session-level lock at
 that point, then we'd orphan it.  We don't make much use of session
 locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
 INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
 standby servers, redo of DROP DATABASE actions.  Any chance one of
 those died or was killed off around the time this happened?

I don't believe this theory at all, because if that were the issue,
we'd have heard about it long since.  The correct theory has to involve
a very-little-used triggering condition.  At the moment I'm wondering
about advisory (userspace) locks ... Dave, do your apps use any of those?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Wed, Sep 07, 2011 at 10:20:24AM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  After spending some time staring at the code, I do have one idea as to
  what might be going on here.  When a backend is terminated,
  ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
  LockReleaseAll(USER_LOCKMETHOD, true).  The second call releases all
  user locks, and the first one (or so we suppose) releases all normal
  locks as part of aborting the transaction.  But what if there's no
  transaction in progress?  In that case, AbortOutOfAnyTransaction()
  won't do anything - which is fine as far as transaction-level locks
  go, because we shouldn't be holding any of those anyway if there's no
  transaction in progress.  However, if we hold a session-level lock at
  that point, then we'd orphan it.  We don't make much use of session
  locks.  As far as I can see, they are used by (1) VACUUM, (2) CREATE
  INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
  standby servers, redo of DROP DATABASE actions.  Any chance one of
  those died or was killed off around the time this happened?
 
 I don't believe this theory at all, because if that were the issue,
 we'd have heard about it long since.  The correct theory has to involve
 a very-little-used triggering condition.  At the moment I'm wondering
 about advisory (userspace) locks ... Dave, do your apps use any of those?

Yes, we make extensive use of advisory locks. That was my thought too when
Robert mentioned session level locks.

I'm happy to add any additional instrumentation, but my client would be
happier to actually run it if there was a way to recover from this without
an unplanned outage. Is there something I can do when the patch detects the
problem to be able to continue without a restart? Is is save to just reset
the proclock queue? I don't think they would mind leaking locks, for instance,
and a later planned restart to clear it up as much as they mind unscheduled
downtime.

Thank

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 4:22 PM, daveg da...@sonic.net wrote:
 Yes, we make extensive use of advisory locks. That was my thought too when
 Robert mentioned session level locks.

 I'm happy to add any additional instrumentation, but my client would be
 happier to actually run it if there was a way to recover from this without
 an unplanned outage. Is there something I can do when the patch detects the
 problem to be able to continue without a restart?

Well, basically, you want to do the same thing that LockRelease()
would do - remove the PROCLOCK from the SHM_QUEUE and delete it from
the hash table, adjusting the counts in the LOCK object as
appropriate.  If you just ignore the failure, you'll get the blah
blah blah is already held messages you were having before.

Tom's right to be skeptical of my theory, because it would require a
CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
pathways that use session-level locks, and I can't find one.

OTOH, I'm skeptical of the theory that this involves userlocks,
because this whole thread started because of a complaint about lock
0/1260/0 already being held.  That ain't no userlock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Tom's right to be skeptical of my theory, because it would require a
 CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
 pathways that use session-level locks, and I can't find one.

More to the point: session-level locks are released on error.  The only
way to get out of a transaction while still holding one is for the
VACUUM-or-whichever-command code to deliberately commit and exit while
still holding it.  An error exit path would release the lock.

 OTOH, I'm skeptical of the theory that this involves userlocks,
 because this whole thread started because of a complaint about lock
 0/1260/0 already being held.  That ain't no userlock.

Yeah, and for that matter it seems to let VACUUM off the hook too.
If we assume that the reported object ID is non-corrupt (and if it's
always the same, that seems like the way to bet) then this is a lock
on pg_authid.

Hmmm ... could the pathway involve an error exit from client
authentication?  We're still finding bugs in the 9.0 rewrite of
auth-time database access.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 4:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, and for that matter it seems to let VACUUM off the hook too.
 If we assume that the reported object ID is non-corrupt (and if it's
 always the same, that seems like the way to bet) then this is a lock
 on pg_authid.

 Hmmm ... could the pathway involve an error exit from client
 authentication?  We're still finding bugs in the 9.0 rewrite of
 auth-time database access.

Well, according to Dave's report upthread, it's not only this one relation:

DG The recent errors are:
DG lock AccessShareLock on object 16403/2615/0 is already held
DG which is for pg_namespace in database c23.

I thought about an error exit from client authentication, and that's a
somewhat appealing explanation, but I can't quite see why we wouldn't
clean up there the same as anywhere else.  The whole mechanism feels a
bit rickety to me - we don't actually release locks; we just abort the
transaction and *assume* that will cause locks to get released.  And
it should.  But there's a bit more action-at-a-distance there than I'd
ideally like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Wed, Sep 07, 2011 at 04:55:24PM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  Tom's right to be skeptical of my theory, because it would require a
  CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
  pathways that use session-level locks, and I can't find one.
 
 More to the point: session-level locks are released on error.  The only
 way to get out of a transaction while still holding one is for the
 VACUUM-or-whichever-command code to deliberately commit and exit while
 still holding it.  An error exit path would release the lock.
 
  OTOH, I'm skeptical of the theory that this involves userlocks,
  because this whole thread started because of a complaint about lock
  0/1260/0 already being held.  That ain't no userlock.
 
 Yeah, and for that matter it seems to let VACUUM off the hook too.
 If we assume that the reported object ID is non-corrupt (and if it's
 always the same, that seems like the way to bet) then this is a lock
 on pg_authid.
 
 Hmmm ... could the pathway involve an error exit from client
 authentication?  We're still finding bugs in the 9.0 rewrite of
 auth-time database access.

It does not seem restricted to pg_authid:

2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on 
object 16403/2615/0 

And I think I've seen it on other tables too.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I thought about an error exit from client authentication, and that's a
 somewhat appealing explanation, but I can't quite see why we wouldn't
 clean up there the same as anywhere else.  The whole mechanism feels a
 bit rickety to me - we don't actually release locks; we just abort the
 transaction and *assume* that will cause locks to get released.

Well, transaction abort will call LockReleaseAll, which is carefully
coded to clean up the proclock lists regardless of what is in the
locallocks table, so I'm not sure why you find that any more rickety
than anything else.  But maybe it'd be interesting for Dave to stick a
LockReleaseAll call into ProcKill() and see if that makes things better.
(Dave: test that before you put it in production, I'm not totally sure
it's safe.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
daveg da...@sonic.net writes:
 It does not seem restricted to pg_authid:
 2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on 
 object 16403/2615/0 
 And I think I've seen it on other tables too.

Hmm.  2615 = pg_namespace, which most likely is the first catalog
accessed by just about any SQL command that's going to access tables at
all, so I suspect that this is mostly just a the first access failed
thing and not something peculiar to pg_namespace.  But we still don't
have a clue why the locks are not getting released by the previous
owner of the PGPROC slot.  Have you trawled your logs to see if there's
any sign of any distress at all, shortly before the problem starts to
happen?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Wed, Sep 07, 2011 at 06:35:08PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
  It does not seem restricted to pg_authid:
  2011-08-24 18:35:57.445 24987  c23  apps  ERROR:  lock AccessShareLock on 
  object 16403/2615/0 
  And I think I've seen it on other tables too.
 
 Hmm.  2615 = pg_namespace, which most likely is the first catalog
 accessed by just about any SQL command that's going to access tables at
 all, so I suspect that this is mostly just a the first access failed
 thing and not something peculiar to pg_namespace.  But we still don't
 have a clue why the locks are not getting released by the previous
 owner of the PGPROC slot.  Have you trawled your logs to see if there's
 any sign of any distress at all, shortly before the problem starts to
 happen?

Will do, but its a pretty big haystack. Sure wish I knew what the needle
looked like. ;-)

-dg
 
-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  I thought about an error exit from client authentication, and that's a
  somewhat appealing explanation, but I can't quite see why we wouldn't
  clean up there the same as anywhere else.  The whole mechanism feels a
  bit rickety to me - we don't actually release locks; we just abort the
  transaction and *assume* that will cause locks to get released.
 
 Well, transaction abort will call LockReleaseAll, which is carefully
 coded to clean up the proclock lists regardless of what is in the
 locallocks table, so I'm not sure why you find that any more rickety
 than anything else.  But maybe it'd be interesting for Dave to stick a
 LockReleaseAll call into ProcKill() and see if that makes things better.
 (Dave: test that before you put it in production, I'm not totally sure
 it's safe.)

Re safety, what is the worst case here? 

Also, this is very intermittant, we have seen it only in recent months
on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
feels like a few times a month. Possibly some new application behaviour
is provoking it, but I have no guesses as to what.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
daveg da...@sonic.net writes:
 On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
 ...  But maybe it'd be interesting for Dave to stick a
 LockReleaseAll call into ProcKill() and see if that makes things better.
 (Dave: test that before you put it in production, I'm not totally sure
 it's safe.)

 Re safety, what is the worst case here? 

I think a failure would be pretty obvious --- if it gets through
regression tests it's probably fine.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
daveg da...@sonic.net writes:
 Also, this is very intermittant, we have seen it only in recent months
 on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
 feels like a few times a month. Possibly some new application behaviour
 is provoking it, but I have no guesses as to what.

BTW ... what were the last versions you were running on which you had
*not* seen the problem?  (Just wondering about the possibility that we
back-patched some fix that broke things.  It would be good to have
a version range before trawling the commit logs.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread daveg
On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
  Also, this is very intermittant, we have seen it only in recent months
  on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
  feels like a few times a month. Possibly some new application behaviour
  is provoking it, but I have no guesses as to what.
 
 BTW ... what were the last versions you were running on which you had
 *not* seen the problem?  (Just wondering about the possibility that we
 back-patched some fix that broke things.  It would be good to have
 a version range before trawling the commit logs.)

The first version we saw it on was 8.4.7.

-dg
 
-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Tom Lane
daveg da...@sonic.net writes:
 On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
 BTW ... what were the last versions you were running on which you had
 *not* seen the problem?  (Just wondering about the possibility that we
 back-patched some fix that broke things.  It would be good to have
 a version range before trawling the commit logs.)

 The first version we saw it on was 8.4.7.

Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-07 Thread Robert Haas
On Wed, Sep 7, 2011 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I thought about an error exit from client authentication, and that's a
 somewhat appealing explanation, but I can't quite see why we wouldn't
 clean up there the same as anywhere else.  The whole mechanism feels a
 bit rickety to me - we don't actually release locks; we just abort the
 transaction and *assume* that will cause locks to get released.

 Well, transaction abort will call LockReleaseAll, which is carefully
 coded to clean up the proclock lists regardless of what is in the
 locallocks table, so I'm not sure why you find that any more rickety
 than anything else.

Well, it's very clear that you CAN orphan locks if a backend holding a
session lock ever does CHECK_FOR_INTERRUPTS() outside of a
transaction.  Try the attached patch.

rhaas=# vacuum full foo;
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
The connection to the server was lost. Attempting reset: Succeeded.
rhaas=# vacuum full foo;
ERROR:  lock AccessExclusiveLock on object 16384/1431013/0 is already held

Now, I don't see any evidence of a live bug here (and on further
thought it can't be Dave's bug because he is orphaning
AccessShareLocks, not AccessExclusiveLocks), but I find this pretty
convincing as a demonstration of ricketiness.  It is certainly not
obvious on its face that a misplaced CHECK_FOR_INTERRUPTS() can result
in a backend exiting without cleaning up its locks, and I'd argue its
a bad idea to leave it that way even if there's no user-visible
problem there today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


break-vacuum.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-08-23 Thread Robert Haas
On Mon, Aug 22, 2011 at 3:31 AM, daveg da...@sonic.net wrote:
 So far I've got:

  - affects system tables
  - happens very soon after process startup
  - in 8.4.7 and 9.0.4
  - not likely to be hardware or OS related
  - happens in clusters for period of a few second to many minutes

 I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
 hard to get downtime to pick up new builds. Any other ideas on getting to
 the bottom of this?

I've been thinking this one over, and doing a little testing. I'm
still stumped, but I have a few thoughts.  What that error message is
really saying is that the LOCALLOCK bookkeeping doesn't match the
PROCLOCK bookkeeping; it doesn't tell us which one is to blame.

My first thought was that there might be some situation where
LockAcquireExtended() gets an interrupt between the time it does the
LOCALLOCK lookup and the time it acquires the partition lock.  If the
interrupt handler were to acquire (but not releases) a lock in the
meantime, then we'd get confused.  However, I can't see how that's
possible.  I inserted some debugging code to fail an assertion if
CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
ImmediateInterruptOK is set on entering the function, and the system
still passes regression tests.

My second thought is that perhaps a process is occasionally managing
to exit without fully cleaning up the associated PROCLOCK entry.  At
first glance, it appears that this would explain the observed
symptoms.  A new backend gets the PGPROC belonging to the guy who
didn't clean up after himself, hits the error, and disconnects,
sticking himself right back on to the head of the SHM_QUEUE where the
next connection will inherit the same PGPROC and hit the same problem.
 But it's not clear to me what could cause the system to get into this
state in the first place, or how it would eventually right itself.

It might be worth kludging up your system to add a test to
InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
either NULL or empty, along the lines of the attached patch (which
assumes that assertions are enabled; otherwise, put in an elog() of
some sort).  Actually, I wonder if we shouldn't move all the
SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
doing it over again every time someone calls InitProcess().  Besides
being a waste of cycles, it's probably less robust this way.   If
there somehow are leftovers in one of those queues, the next
successful call to LockReleaseAll() ought to clean up the mess, but of
course there's no chance of that working if we've nuked the queue
pointers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


initprocess-assert.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-08-22 Thread daveg
On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote:
 
 This seems to be bug month for my client. Now there are seeing periods
 where all new connections fail immediately with the error:
 
FATAL:  lock AccessShareLock on object 0/1260/0 is already held 
 
 This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
 been up for months. It started happening sporadicly a few days ago. It will
 do this for a period of several minutes to an hour and then go back to
 normal for hours or days.
 
 One complete failing session out of several hundred around that time:
 -
 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection 
 received: host=op05.xxx port=34067
 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: 
 user=apps database=c77
 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on 
 object 0/1260/0 is already held
 --

This is to add additional information to the original report:

For a while this was happening on many different databases in one postgresql
8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5.
That has been quiet for several days and the newest batches of errors have
happened on only on a single database 'c23', in a postgresql 9.0.4 instance
on a smaller host ('A', 64GB 8cpu) running SuSE 10.2.

No memory errors or other misbehaviour have been seen on either of these
hosts in recent months.

The original error was:

  lock AccessShareLock on object 0/1260/0 is already held

which is for pg_database. The recent errors are:

  lock AccessShareLock on object 16403/2615/0 is already held

which is for pg_namespace in database c23.

All of the orginal and most of the recent batchs of errors were immediately
after connecting to a database and being authorized, that is, before any
statements were attempted. However, some of the most recent are on the first
query statement. That is after logging in and doing things like set
transaction ...  the first select would hit this error.

It seems to come in clusters, sometimes, which suggests something shared
by multiple processes. For example, here are the times for the errors
on c23 in the afternoon of August 20:

20 07:14:12.722

20 16:05:07.798
20 16:05:07.808

20 16:05:10.519

20 16:07:07.726
20 16:07:08.722
20 16:07:09.734
20 16:07:10.656

20 16:07:25.436

20 16:22:23.983
20 16:22:24.014
20 16:22:24.335
20 16:22:24.409
20 16:22:24.477
20 16:22:24.499
20 16:22:24.516

20 16:30:58.210

20 16:31:15.261
20 16:31:15.296
20 16:31:15.324
20 16:31:15.348

20 18:06:16.515

20 18:06:49.198
20 18:06:49.204

20 18:06:51.444

20 21:03:05.940

So far I've got:

  - affects system tables
  - happens very soon after process startup
  - in 8.4.7 and 9.0.4
  - not likely to be hardware or OS related
  - happens in clusters for period of a few second to many minutes

I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
hard to get downtime to pick up new builds. Any other ideas on getting to
the bottom of this?

Thanks

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers