subject:"\[HACKERS\] stuck spinlock"

Re: [HACKERS] stuck spinlock

2013-12-26 Thread Peter Eisentraut

On 12/12/13, 8:45 PM, Tom Lane wrote:
 Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
 most systems dump core files with process IDs embedded in the names.

Which systems are those?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-26 Thread Robert Haas

On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote:
 On 12/12/13, 8:45 PM, Tom Lane wrote:
 Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
 most systems dump core files with process IDs embedded in the names.

 Which systems are those?

MacOS X dumps core files into /cores/core.$PID, and at least some
Linux systems seem to dump them into ./core.$PID

I don't know how universal this is.  I think a bigger objection to the
SIGSTOP stuff is that a lot of bugs are too real time to ever be
meaningfully caught that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-26 Thread Martijn van Oosterhout

On Thu, Dec 26, 2013 at 03:18:23PM -0800, Robert Haas wrote:
 On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote:
  On 12/12/13, 8:45 PM, Tom Lane wrote:
  Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
  most systems dump core files with process IDs embedded in the names.
 
  Which systems are those?
 
 MacOS X dumps core files into /cores/core.$PID, and at least some
 Linux systems seem to dump them into ./core.$PID

On Linux it's configurable and at least on Ubuntu you get this:

$ cat /proc/sys/kernel/core_pattern 
|/usr/share/apport/apport %p %s %c

But yes, it can be configured to icnclude the PID in the filename.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] stuck spinlock

2013-12-26 Thread Andres Freund

On 2013-12-12 20:45:17 -0500, Tom Lane wrote:
 Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
 most systems dump core files with process IDs embedded in the names.
 What would be more useful today is an option to send SIGABRT, or some
 other signal that would force core dumps.  Thoughts?

Although I didn't know of the option, I thought having it would be
useful before. It allows you to inspect the memory of the individual
backends while they are still alive - which allows gdb to call
functions. Which surely is helpful when debugging some issues.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-17 Thread bricklen

On Mon, Dec 16, 2013 at 6:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Andres Freund and...@2ndquadrant.com writes:
  Hard to say, the issues fixed in the release are quite important as
  well. I'd tend to say they are more important. I think we just need to
  release 9.3.3 pretty soon.

 Yeah.


Has there been any talk about when a 9.3.3 (and/or 9.2.7?) patch might be
released?

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Merlin Moncure

On Sat, Dec 14, 2013 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote:
 On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote:
  On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  And while we're on the subject ... isn't bgworker_die() utterly and
  completely broken?  That unconditional elog(FATAL) means that no process
  using that handler can do anything remotely interesting, like say touch
  shared memory.
 
  Yeah, but for the record (since I see I got cc'd here), that's not my
  fault.  I moved it into bgworker.c, but it's been like that since
  Alvaro's original commit of the bgworker facility
  (da07a1e856511dca59cbb1357616e26baa64428e).


 Is this an edge case or something that will hit a lot of users?
 Arbitrary server panics seems pretty serious...

 Is your question about the bgworker part you're quoting or about the
 stuck spinlock stuff? I don't think the bgworker bug is too bad in
 practice but the one in handle_sig_alarm() stuff certainly is.

 I think while it looks possible to hit problems without statement/lock
 timeout, it's relatively unlikely that those are hit in practice.

Well, both -- I was just wondering out loud what the severity level of
this issue was.  In particular, is it advisable for the general public
avoid this release?   My read on this is 'probably'.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Andres Freund

On 2013-12-16 08:36:51 -0600, Merlin Moncure wrote:
 On Sat, Dec 14, 2013 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote:
  Is this an edge case or something that will hit a lot of users?
  Arbitrary server panics seems pretty serious...
 
  Is your question about the bgworker part you're quoting or about the
  stuck spinlock stuff? I don't think the bgworker bug is too bad in
  practice but the one in handle_sig_alarm() stuff certainly is.
 
  I think while it looks possible to hit problems without statement/lock
  timeout, it's relatively unlikely that those are hit in practice.
 
 Well, both -- I was just wondering out loud what the severity level of
 this issue was.  In particular, is it advisable for the general public
 avoid this release?   My read on this is 'probably'.

Hard to say, the issues fixed in the release are quite important as
well. I'd tend to say they are more important. I think we just need to
release 9.3.3 pretty soon.

The multixact fixes in 9.3.2 weren't complete either... (see recent push)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 Hard to say, the issues fixed in the release are quite important as
 well. I'd tend to say they are more important. I think we just need to
 release 9.3.3 pretty soon.

Yeah.

 The multixact fixes in 9.3.2 weren't complete either... (see recent push)

Are they complete now?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Andres Freund

On 2013-12-16 09:46:19 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  The multixact fixes in 9.3.2 weren't complete either... (see recent push)
 
 Are they complete now?

Hm. There's two issues I know of left, both discovered in #8673:
- slru.c:SlruScanDirectory() doesn't support long enough
  filenames. Afaics that should be a fairly easy fix.
- multixact/members isn't protected against wraparounds, only
  multixact/offsets is. That's a pretty longstanding bug though,
  although more likely to be hit these days.

Furthermore there's some missing optimizations (like the useless
multixact generation you noted upon in Update with subselect sometimes
returns wrong result), but those shouldn't hold up a release.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-16 09:46:19 -0500, Tom Lane wrote:
 Are they complete now?

 Hm. There's two issues I know of left, both discovered in #8673:
 - slru.c:SlruScanDirectory() doesn't support long enough
   filenames. Afaics that should be a fairly easy fix.
 - multixact/members isn't protected against wraparounds, only
   multixact/offsets is. That's a pretty longstanding bug though,
   although more likely to be hit these days.

Actually, isn't this one a must-fix as well?

http://www.postgresql.org/message-id/CAPweHKe5QQ1747X2c0tA=5zf4yns2xcvgf13opd-1mq24rf...@mail.gmail.com

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-16 Thread Alvaro Herrera

Tom Lane escribió:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-12-16 09:46:19 -0500, Tom Lane wrote:
  Are they complete now?
 
  Hm. There's two issues I know of left, both discovered in #8673:
  - slru.c:SlruScanDirectory() doesn't support long enough
filenames. Afaics that should be a fairly easy fix.
  - multixact/members isn't protected against wraparounds, only
multixact/offsets is. That's a pretty longstanding bug though,
although more likely to be hit these days.
 
 Actually, isn't this one a must-fix as well?
 
 http://www.postgresql.org/message-id/CAPweHKe5QQ1747X2c0tA=5zf4yns2xcvgf13opd-1mq24rf...@mail.gmail.com

Yep, I'm going through that one now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-14 Thread Andres Freund

On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote:
 On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote:
  On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  And while we're on the subject ... isn't bgworker_die() utterly and
  completely broken?  That unconditional elog(FATAL) means that no process
  using that handler can do anything remotely interesting, like say touch
  shared memory.
 
  Yeah, but for the record (since I see I got cc'd here), that's not my
  fault.  I moved it into bgworker.c, but it's been like that since
  Alvaro's original commit of the bgworker facility
  (da07a1e856511dca59cbb1357616e26baa64428e).
 
 
 Is this an edge case or something that will hit a lot of users?
 Arbitrary server panics seems pretty serious...

Is your question about the bgworker part you're quoting or about the
stuck spinlock stuff? I don't think the bgworker bug is too bad in
practice but the one in handle_sig_alarm() stuff certainly is.

I think while it looks possible to hit problems without statement/lock
timeout, it's relatively unlikely that those are hit in practice.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-14 Thread Andres Freund

Hi,

On 2013-12-13 15:57:14 -0300, Alvaro Herrera wrote:
 If there was a way for raising an #error at compile time whenever a
 worker relies on the existing signal handler, I would vote for doing
 that.  (But then I have no idea how to do such a thing.)

I don't see a way either given how disconnected registration of the
signal handler is from the bgworker infrastructure. I think the best we
can do is to raise an error in BackgroundWorkerUnblockSignals() - and we
should definitely do that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-14 Thread Andres Freund

On 2013-12-13 13:39:42 -0500, Robert Haas wrote:
 On Fri, Dec 13, 2013 at 1:15 PM, Andres Freund and...@2ndquadrant.com wrote:
  Agreed on not going forward like now, but I don't really see how they
  could usefully use die(). I think we should just mandate that every
  bgworker conneced to shared memory registers a sigterm handler - we
  could put a check into BackgroundWorkerUnblockSignals(). We should leave
  the current handler in for unconnected one though...
  bgworkers are supposed to be written as a loop around procLatch, so
  adding a !got_sigterm, probably isn't too hard.
 
 I think the !got_sigterm thing is complete bunk.  If a background
 worker is running SQL queries, it really ought to honor a query cancel
 or sigterm at the next CHECK_FOR_INTERRUPTS().

I am not convinced by the necessity of that, not in general. After all,
the code is using a bgworker and not a normal backend for a reason. If
you e.g. have queing code, it very well might need to serialize its
state to disk before shutting down. Checking whether the bgworker should
shut down every iteration of the mainloop sounds appropriate to me for
such cases.

But I think we should provide a default handler that does the necessary
things to interrupt queries, so bgworker authors don't have to do it
themselves and, just as importantly, we can more easily add new stuff
there.

 +static void
 +handle_sigterm(SIGNAL_ARGS)
 +{
 +   int save_errno = errno;
 +
 +   if (MyProc)
 +   SetLatch(MyProc-procLatch);
 +
 +   if (!proc_exit_inprogress)
 +   {
 +   InterruptPending = true;
 +   ProcDiePending = true;
 +   }
 +
 +   errno = save_errno;
 +}
 
 ...but I'm not 100% sure that's right, either.

If you want a bgworker to behave as close as a normal backend we should
probably really do the full dance die() does. Specifically call
ProcessInterrupts() immediately if ImmediateInterruptOK allows it,
otherwise we'd just continue waiting for locks and similar.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

Hi,

On 2013-12-12 19:35:36 -0800, Christophe Pettus wrote:
 On Dec 12, 2013, at 6:41 PM, Andres Freund and...@2ndquadrant.com wrote:
 
  Christophe: are there any unusual ERROR messages preceding the crash,
  possibly some minutes before?
 
 Interestingly, each spinlock PANIC is *followed*, about one minute later (+/- 
 five seconds) by a canceling statement due to statement timeout on that 
 exact query.  The queries vary enough in text that it is unlikely to be a 
 coincidence.
 
 There are a *lot* of canceling statement due to statement timeout messages, 
 which is interesting, because:

Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5?
Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()?
ISTM nothing is preventing us from jumping out of code holding a
spinlock?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5?
 Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()?
 ISTM nothing is preventing us from jumping out of code holding a
 spinlock?

Hm ... what should stop it is that ImmediateInterruptOK wouldn't be
set while we're messing with any spinlocks.  Except that ProcessInterrupts
doesn't check that gating condition :-(.  I think you're probably right:
what should be in the interrupt handler is something like
if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS();

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

On 2013-12-13 09:52:06 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5?
  Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()?
  ISTM nothing is preventing us from jumping out of code holding a
  spinlock?
 
 Hm ... what should stop it is that ImmediateInterruptOK wouldn't be
 set while we're messing with any spinlocks.  Except that ProcessInterrupts
 doesn't check that gating condition :-(.

It really can't, right? Otherwise explicit CHECK_FOR_INTERRUPTS()s in
normal code wouldn't do much anymore since ImmediateInterruptOK is so
seldomly set. The control flow around signal handling always drives me
crazy.

 I think you're probably right:
 what should be in the interrupt handler is something like
 if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS();

Yea, that sounds right. Or just don't set process interrupts there, it
doesn't seem to be required for correctness?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-13 09:52:06 -0500, Tom Lane wrote:
 I think you're probably right:
 what should be in the interrupt handler is something like
 if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS();

 Yea, that sounds right. Or just don't set process interrupts there, it
 doesn't seem to be required for correctness?

It is if we need to break out of a wait-for-lock ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

On 2013-12-13 10:30:48 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-12-13 09:52:06 -0500, Tom Lane wrote:
  I think you're probably right:
  what should be in the interrupt handler is something like
  if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS();
 
  Yea, that sounds right. Or just don't set process interrupts there, it
  doesn't seem to be required for correctness?
 
 It is if we need to break out of a wait-for-lock ...

Right, that uses MyProc-sem and not MyProc-procLatch...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

On closer inspection, I'm thinking that actually it'd be a good idea if
handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt:
it should save, clear, and restore ImmediateInterruptOK, so as to make
the world safe for timeout handlers to do things that might include a
CHECK_FOR_INTERRUPTS.

And while we're on the subject ... isn't bgworker_die() utterly and
completely broken?  That unconditional elog(FATAL) means that no process
using that handler can do anything remotely interesting, like say touch
shared memory.

I didn't find any other similar hazards in a quick look through all our
signal handlers.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

On 2013-12-13 11:26:44 -0500, Tom Lane wrote:
 On closer inspection, I'm thinking that actually it'd be a good idea if
 handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt:
 it should save, clear, and restore ImmediateInterruptOK, so as to make
 the world safe for timeout handlers to do things that might include a
 CHECK_FOR_INTERRUPTS.

Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any
eventual ProcessInterrupts() in the timeout handlers from doing anything
harmful?
Even if so, making sure ImmediateInterruptOK is preserved seems worthwile
anyway.

 And while we're on the subject ... isn't bgworker_die() utterly and
 completely broken?  That unconditional elog(FATAL) means that no process
 using that handler can do anything remotely interesting, like say touch
 shared memory.

Yes, looks broken to me.

 I didn't find any other similar hazards in a quick look through all our
 signal handlers.

One thing I randomly noticed just now is the following in
RecoveryConflictInterrupt():
elog(FATAL, unrecognized conflict mode: %d,
 (int) reason);
obviously that's not really ever going to hit, but it should either be a
PANIC or an Assert() for the reasons you cite.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 Yes, that's what is happening there (I had to check with the client's 
 developers).  It's possible that the one-minute repeat is due to the 
 application reissuing the query, rather than specifically related to the 
 spinlock issue.  What this does reveal is that all the spinlock issues have 
 been on long-running queries, for what it is worth.

Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see
if that doesn't fix it for you.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Christophe Pettus


On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see
 if that doesn't fix it for you.

Great, thanks.  Would the statement_timeout firing invoke this path?  (I'm 
wondering why this particular installation was experiencing this.)

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-13 11:26:44 -0500, Tom Lane wrote:
 On closer inspection, I'm thinking that actually it'd be a good idea if
 handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt:
 it should save, clear, and restore ImmediateInterruptOK, so as to make
 the world safe for timeout handlers to do things that might include a
 CHECK_FOR_INTERRUPTS.

 Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any
 eventual ProcessInterrupts() in the timeout handlers from doing anything
 harmful?

Sorry, I misspoke there.  The case I'm worried about is doing something
like a wait for lock, which would unconditionally set and then reset
ImmediateInterruptOK.  That's not very plausible perhaps, but on the other
hand we are calling DeadLockCheck() in there, and who knows what future
timeout handlers might try to do?

BTW, I'm about to go put a HOLD_INTERRUPTS/RESUME_INTERRUPTS into
HandleCatchupInterrupt and HandleNotifyInterrupt too, for essentially the
same reason.  At least the first of these *does* include semaphore ops,
so I think it's theoretically vulnerable to losing control if a timeout
occurs while it's waiting for a semaphore.  There's probably no real bug
today because I don't think we enable catchup interrupts at any point
where a timeout would be active, but that doesn't sound terribly
future-proof.  If a timeout did happen, holding off interrupts would have
the effect of postponing the query cancel till we're done with the catchup
interrupt, which seems reasonable.

 One thing I randomly noticed just now is the following in
 RecoveryConflictInterrupt():
   elog(FATAL, unrecognized conflict mode: %d,
(int) reason);
 obviously that's not really ever going to hit, but it should either be a
 PANIC or an Assert() for the reasons you cite.

Yeah, PANIC there seems good.  I also thought about using
START_CRIT_SECTION/END_CRIT_SECTION instead of
HOLD_INTERRUPTS/RESUME_INTERRUPTS in these signal handlers.  That would
both hold off interrupts and cause any elog(ERROR/FATAL) within the
handler to be promoted to PANIC.  But I'm not sure that'd be a net
stability improvement...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see
 if that doesn't fix it for you.

 Great, thanks.  Would the statement_timeout firing invoke this path?  (I'm 
 wondering why this particular installation was experiencing this.)

Yeah, the problem is that either statement_timeout or lock_timeout
could cause control to be taken away from code that thinks it's
straight-line code and so doesn't have provision for getting cleaned
up at transaction abort.  Spinlocks certainly fall in that category.
I'm afraid other weird failures are possible, though I'm not sure
what.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

On 2013-12-13 12:19:56 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any
  eventual ProcessInterrupts() in the timeout handlers from doing anything
  harmful?
 
 Sorry, I misspoke there.  The case I'm worried about is doing something
 like a wait for lock, which would unconditionally set and then reset
 ImmediateInterruptOK.

I sure hope we're not going to introduce more paths that do this, but I
am not going to bet on it...

I remember trying to understand why the deadlock detector is safe doing
as it does when I was all green and was trying to understand the HS patch
and it drove me nuts.

 BTW, I'm about to go put a HOLD_INTERRUPTS/RESUME_INTERRUPTS into
 HandleCatchupInterrupt and HandleNotifyInterrupt too, for essentially the
 same reason.

Sounds good, both already do a ProcessInterrupts() at their end, so the
holdoff shouldn't lead to absorbed interrupts.

I wonder what to do about bgworker's bgworker_die()? I don't really see
how that can be fixed without breaking the API?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 I wonder what to do about bgworker's bgworker_die()? I don't really see
 how that can be fixed without breaking the API?

IMO it should be flushed and bgworkers should use the same die() handler
as every other backend, or else one like the one in worker_spi, which just
sets a flag for testing later.  If we try to change the signal handling
contracts, 80% of backend code will be unusable in bgworkers, which is not
where we want to be I think.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Andres Freund

On 2013-12-13 12:54:09 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I wonder what to do about bgworker's bgworker_die()? I don't really see
  how that can be fixed without breaking the API?
 
 IMO it should be flushed and bgworkers should use the same die() handler
 as every other backend, or else one like the one in worker_spi, which just
 sets a flag for testing later.

Agreed on not going forward like now, but I don't really see how they
could usefully use die(). I think we should just mandate that every
bgworker conneced to shared memory registers a sigterm handler - we
could put a check into BackgroundWorkerUnblockSignals(). We should leave
the current handler in for unconnected one though...
bgworkers are supposed to be written as a loop around procLatch, so
adding a !got_sigterm, probably isn't too hard.

It sucks that people might have bgworkers out there that don't register
their own sigterm handlers, but adding a sigterm handler will be
backward compatible and it's in the example bgworker, so it's probably
not too bad.

 If we try to change the signal handling
 contracts, 80% of backend code will be unusable in bgworkers, which is not
 where we want to be I think.

Yea, I think that's out of the question.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Robert Haas

On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 And while we're on the subject ... isn't bgworker_die() utterly and
 completely broken?  That unconditional elog(FATAL) means that no process
 using that handler can do anything remotely interesting, like say touch
 shared memory.

Yeah, but for the record (since I see I got cc'd here), that's not my
fault.  I moved it into bgworker.c, but it's been like that since
Alvaro's original commit of the bgworker facility
(da07a1e856511dca59cbb1357616e26baa64428e).

While I was developing the shared memory message queueing stuff, I
experimented using die() as the signal handler and didn't have very
good luck.  I can't remember exactly what wasn't working any more,
though.  I agree that it would be good if we can make that work.
Right now we've got other modules growing warts like
WalRcvImmediateInterruptOK, which doesn't seem good.

It seems to me that we should change every place that temporarily
changes ImmediateInterruptOK to restore the original value instead of
making assumptions about what it must have been.
ClientAuthentication(), md5_crypt_verify(), PGSemaphoreLock() and
WalSndLoop() all have this disease.

I also really wonder if notify and catchup interrupts ought to be
taught to respect ImmediateInterruptOK, instead of having their own
switches for the same thing.  Right now there are an awful lot of
places that do this:

ImmediateInterruptOK = false;   /* not idle anymore */
DisableNotifyInterrupt();
DisableCatchupInterrupt();

...and that doesn't seem like a good thing.  Heaven forfend someone
were to do only two out of the three.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Robert Haas

On Fri, Dec 13, 2013 at 1:15 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-13 12:54:09 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I wonder what to do about bgworker's bgworker_die()? I don't really see
  how that can be fixed without breaking the API?

 IMO it should be flushed and bgworkers should use the same die() handler
 as every other backend, or else one like the one in worker_spi, which just
 sets a flag for testing later.

 Agreed on not going forward like now, but I don't really see how they
 could usefully use die(). I think we should just mandate that every
 bgworker conneced to shared memory registers a sigterm handler - we
 could put a check into BackgroundWorkerUnblockSignals(). We should leave
 the current handler in for unconnected one though...
 bgworkers are supposed to be written as a loop around procLatch, so
 adding a !got_sigterm, probably isn't too hard.

I think the !got_sigterm thing is complete bunk.  If a background
worker is running SQL queries, it really ought to honor a query cancel
or sigterm at the next CHECK_FOR_INTERRUPTS().  But the default
background worker handler for SIGUSR1 just sets the process latch, and
worker_spi's sigterm handler just sets a private variable got_sigterm.
 So ProcessInterrupts() will never get called, and if it did it
wouldn't do anything anyway. That's really pretty horrible, because it
means that the query worker_spi runs can't be interrupted short of a
SIGQUIT.  So I think worker_spi is really a very bad example of how to
do this right.  In the as-yet-uncommitted test-shm-mq-v1.patch, I did
this:

+static void
+handle_sigterm(SIGNAL_ARGS)
+{
+   int save_errno = errno;
+
+   if (MyProc)
+   SetLatch(MyProc-procLatch);
+
+   if (!proc_exit_inprogress)
+   {
+   InterruptPending = true;
+   ProcDiePending = true;
+   }
+
+   errno = save_errno;
+}

...but I'm not 100% sure that's right, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Alvaro Herrera

Robert Haas escribió:
 On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  And while we're on the subject ... isn't bgworker_die() utterly and
  completely broken?  That unconditional elog(FATAL) means that no process
  using that handler can do anything remotely interesting, like say touch
  shared memory.
 
 Yeah, but for the record (since I see I got cc'd here), that's not my
 fault.  I moved it into bgworker.c, but it's been like that since
 Alvaro's original commit of the bgworker facility
 (da07a1e856511dca59cbb1357616e26baa64428e).

I see the blame falls on me ;-)  I reckon I blindly copied this
stuff from elsewhere without thinking very much about it.  As noted
upthread, even the example code uses a different handler for SIGTERM.
There wasn't much else that we could do; simply letting the generic code
run without any SIGTERM installed didn't seem the right thing to do.
(You probably recall that the business of starting workers with signals
blocked was installed later.)

I found a few workers in github in a quick search:
https://github.com/umitanuki/mongres/blob/master/mongres.c
https://github.com/markwkm/pg_httpd/blob/master/pg_httpd.c
https://github.com/ibarwick/config_log/blob/master/config_log.c
https://github.com/gleu/stats_recorder/blob/master/stats_recorder_spi.c
https://github.com/michaelpq/pg_workers/blob/master/kill_idle/kill_idle.c
https://github.com/le0pard/pg_web/blob/master/src/pg_web.c

Not a single one of the uses bgworker_die() -- they all follow
worker_spi's lead of setting a got_sigterm flag and SetLatch().

If there was a way for raising an #error at compile time whenever a
worker relies on the existing signal handler, I would vote for doing
that.  (But then I have no idea how to do such a thing.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 It seems to me that we should change every place that temporarily
 changes ImmediateInterruptOK to restore the original value instead of
 making assumptions about what it must have been.

No, that's backwards.  The problem isn't that it could be sane to enter,
say, PGSemaphoreLock with ImmediateInterruptOK already true; to get
there, you'd have had to pass through boatloads of code in which it
patently isn't safe for that to be the case.  Rather, the problem is
that once you get there it might *still* be unsafe to throw an error.
HOLD/RESUME_INTERRUPTS are designed to handle exactly that problem.

The only other way we could handle it would be if every path from (say)
HandleNotifyInterrupt down to PGSemaphoreLock passed a bool flag to tell
it don't turn on ImmediateInterruptOK; which is pretty unworkable.

 I also really wonder if notify and catchup interrupts ought to be
 taught to respect ImmediateInterruptOK, instead of having their own
 switches for the same thing.

They're not switches for the same thing though; the effects are
different, and in fact there are places that do and should flip only some
of these, PGSemaphoreLock being just the most obvious one.  I agree that
it might be possible to simplify things, but it would take more thought
than you seem to have put into it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Merlin Moncure

On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 And while we're on the subject ... isn't bgworker_die() utterly and
 completely broken?  That unconditional elog(FATAL) means that no process
 using that handler can do anything remotely interesting, like say touch
 shared memory.

 Yeah, but for the record (since I see I got cc'd here), that's not my
 fault.  I moved it into bgworker.c, but it's been like that since
 Alvaro's original commit of the bgworker facility
 (da07a1e856511dca59cbb1357616e26baa64428e).


Is this an edge case or something that will hit a lot of users?
Arbitrary server panics seems pretty serious...

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Christophe Pettus


On Dec 13, 2013, at 1:49 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Is this an edge case or something that will hit a lot of users?

My understanding (Tom can correct me if I'm wrong, I'm sure) is that it is an 
issue for servers on 9.3.2 where there are a lot of query cancellations due to 
facilities like statement_timeout or lock_timeout that cancel a query 
asynchronously.  I assume pg_cancel_backend() would apply as well.

We've only seen it on one client, and that client had a *lot* (thousands on 
thousands) of statement_timeout cancellations.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-13 Thread Christophe Pettus


On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see
 if that doesn't fix it for you.

It appears to fix it.  Thanks!

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus

Greetings,

Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
frequent (hourly) errors of the form:

/var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 
21:40:10.328 
UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 
21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted

uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu 
Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.

Generally, there's no core file (which is currently enable), as the postmaster 
just normally exits the backend.

Diagnosis suggestions?
 
--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
 frequent (hourly) errors of the form:

 /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12
  21:40:10.328 
 UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 
 21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected 
 at 
 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted

 uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP 
 Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux.

 Generally, there's no core file (which is currently enable), as the 
 postmaster just normally exits the backend.

Hm, a PANIC really ought to result in a core file.  You sure you don't
have that disabled (perhaps via a ulimit setting)?

As for the root cause, it's hard to say.  The file/line number says it's
a buffer header lock that's stuck.  I rechecked all the places that lock
buffer headers, and all of them have very short code paths to the
corresponding unlock, so there's no obvious explanation how this could
happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Andres Freund

On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
 Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
 frequent (hourly) errors of the form:
 
 /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12
  21:40:10.328 
 UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 
 21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected 
 at 
 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted

Any other changes but the upgrade? Maybe a different compiler version?

Also, could you share some details about the workload? Highly
concurrent? Standby? ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 3:33 PM, Andres Freund and...@2ndquadrant.com wrote:
 Any other changes but the upgrade? Maybe a different compiler version?

Show pg_config output.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 3:37 PM, Peter Geoghegan p...@heroku.com wrote:
 Show pg_config output.

Below; it's the Ubuntu package.

BINDIR = /usr/lib/postgresql/9.3/bin
DOCDIR = /usr/share/doc/postgresql-doc-9.3
HTMLDIR = /usr/share/doc/postgresql-doc-9.3
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/9.3/server
LIBDIR = /usr/lib
PKGLIBDIR = /usr/lib/postgresql/9.3/lib
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/9.3/man
SHAREDIR = /usr/share/postgresql/9.3
SYSCONFDIR = /etc/postgresql-common
PGXS = /usr/lib/postgresql/9.3/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-tcl' '--with-perl' '--with-python' '--with-pam' 
'--with-openssl' '--with-libxml' '--with-libxslt' 
'--with-tclconfig=/usr/lib/tcl8.5' '--with-tkconfig=/usr/lib/tk8.5' 
'--with-includes=/usr/include/tcl8.5' 'PYTHON=/usr/bin/python' 
'--mandir=/usr/share/postgresql/9.3/man' 
'--docdir=/usr/share/doc/postgresql-doc-9.3' 
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' 
'--datadir=/usr/share/postgresql/9.3' '--bindir=/usr/lib/postgresql/9.3/bin' 
'--libdir=/usr/lib/' '--libexecdir=/usr/lib/postgresql/' 
'--includedir=/usr/include/postgresql/' '--enable-nls' 
'--enable-integer-datetimes' '--enable-thread-safety' '--enable-debug' 
'--disable-rpath' '--with-ossp-uuid' '--with-gnu-ld' '--with-pgport=5432' 
'--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector 
--param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security 
-fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0' 
'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed 
-L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' 
'--with-gssapi' '--with-ldap' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'
CC = gcc
CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 
-I/usr/include/tcl8.5
CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat 
-Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 
-DLINUX_OOM_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro 
-Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 
-L/usr/lib/x86_64-linux-gnu/mit-krb5 -L/usr/lib/x86_64-linux-gnu -Wl,--as-needed
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lpam -lssl -lcrypto -lkrb5 -lcom_err 
-lgssapi_krb5 -lz -ledit -lcrypt -ldl -lm 
VERSION = PostgreSQL 9.3.2

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 3:33 PM, Andres Freund and...@2ndquadrant.com wrote:

 Any other changes but the upgrade? Maybe a different compiler version?

Just the upgrade; they're using the Ubuntu packages from apt.postgresql.org.

 Also, could you share some details about the workload? Highly
 concurrent? Standby? ...

The workload is not very highly concurrent; actually quite lightly loaded.   
There are a very large number (442,000) of user tables.  No standby attached.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hm, a PANIC really ought to result in a core file.  You sure you don't
 have that disabled (perhaps via a ulimit setting)?

Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in 
/etc/postgresql/9.3/main/pg_ctl.conf.

 As for the root cause, it's hard to say.  The file/line number says it's
 a buffer header lock that's stuck.  I rechecked all the places that lock
 buffer headers, and all of them have very short code paths to the
 corresponding unlock, so there's no obvious explanation how this could
 happen.

The server was running with shared_buffers=100GB, but the problem has 
reoccurred now with shared_buffers=16GB.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 On Dec 12, 2013, at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hm, a PANIC really ought to result in a core file.  You sure you don't
 have that disabled (perhaps via a ulimit setting)?

 Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in 
 /etc/postgresql/9.3/main/pg_ctl.conf.

[ shrug... ]  If you aren't getting a core file for a PANIC, then core
files are disabled.  I take no position on the value of the setting
you mention above, but I will note that pg_ctl can't override a hard
ulimit -c 0 system-wide setting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If you aren't getting a core file for a PANIC, then core
 files are disabled.

And just like that, we get one.  Stack trace:

#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f699c81991b in errfinish ()
#3  0x7f699c81a477 in elog_finish ()
#4  0x7f699c735db3 in s_lock ()
#5  0x7f699c71e1f0 in ?? ()
#6  0x7f699c71eaf9 in ?? ()
#7  0x7f699c71f53e in ReadBufferExtended ()
#8  0x7f699c56d03a in index_fetch_heap ()
#9  0x7f699c67a0b7 in ?? ()
#10 0x7f699c66e98e in ExecScan ()
#11 0x7f699c6679a8 in ExecProcNode ()
#12 0x7f699c67407f in ExecAgg ()
#13 0x7f699c6678b8 in ExecProcNode ()
#14 0x7f699c664dd2 in standard_ExecutorRun ()
#15 0x7f6996ad928d in ?? ()
   from /usr/lib/postgresql/9.3/lib/auto_explain.so
#16 0x7f69968d3525 in ?? ()
   from /usr/lib/postgresql/9.3/lib/pg_stat_statements.so
#17 0x7f699c745207 in ?? ()
#18 0x7f699c746651 in PortalRun ()
#19 0x7f699c742960 in PostgresMain ()
#20 0x7f699c6ff765 in PostmasterMain ()
#21 0x7f699c53bea2 in main ()


--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Andres Freund

On 2013-12-12 16:22:28 -0800, Christophe Pettus wrote:
 
 On Dec 12, 2013, at 4:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  If you aren't getting a core file for a PANIC, then core
  files are disabled.
 
 And just like that, we get one.  Stack trace:
 
 #0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
 (gdb) bt
 #0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
 #1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  0x7f699c81991b in errfinish ()
 #3  0x7f699c81a477 in elog_finish ()
 #4  0x7f699c735db3 in s_lock ()
 #5  0x7f699c71e1f0 in ?? ()
 #6  0x7f699c71eaf9 in ?? ()

Could you install the -dbg package and regenerate?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 Could you install the -dbg package and regenerate?

Of course!

#0  0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f699c81991b in errfinish (dummy=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3  0x7f699c81a477 in elog_finish (elevel=optimized out, 
fmt=0x7f699c937a48 stuck spinlock (%p) detected at %s:%d)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4  0x7f699c735db3 in s_lock_stuck (line=1099, 
file=0x7f699c934a78 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c,
 lock=0x7f6585e2cbb4 \001)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5  s_lock (lock=0x7f6585e2cbb4 \001, 
file=0x7f699c934a78 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c,
 line=1099)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6  0x7f699c71e1f0 in PinBuffer (buf=0x7f6585e2cb94, strategy=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7  0x7f699c71eaf9 in BufferAlloc (foundPtr=0x7fff60ec563e , 
strategy=0x0, blockNum=1730, forkNum=MAIN_FORKNUM, relpersistence=112 'p', 
smgr=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8  ReadBuffer_common (smgr=optimized out, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=1730, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff60ec56af )
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9  0x7f699c71f53e in ReadBufferExtended (reln=0x7f6577d80560, 
forkNum=MAIN_FORKNUM, blockNum=1730, mode=optimized out, 
strategy=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x7f699c56d03a in index_fetch_heap (scan=0x7f699f94c7a0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/index/indexam.c:515
#11 0x7f699c67a0b7 in IndexOnlyNext (node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeIndexonlyscan.c:109
#12 0x7f699c66e98e in ExecScanFetch (
recheckMtd=0x7f699c679fb0 IndexOnlyRecheck, 
accessMtd=0x7f699c679fe0 IndexOnlyNext, node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c---Type
 return to continue, or q return to quit---
:82
#13 ExecScan (node=0x7f699f94b690, accessMtd=0x7f699c679fe0 IndexOnlyNext, 
recheckMtd=0x7f699c679fb0 IndexOnlyRecheck)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#14 0x7f699c6679a8 in ExecProcNode (node=0x7f699f94b690)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:408
#15 0x7f699c67407f in agg_retrieve_direct (aggstate=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#16 ExecAgg (node=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#17 0x7f699c6678b8 in ExecProcNode (node=0x7f699f94af90)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#18 0x7f699c664dd2 in ExecutePlan (dest=0x7f699f98c308, 
direction=optimized out, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, planstate=0x7f699f94af90, estate=0x7f699f94ae80)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#19 standard_ExecutorRun (queryDesc=0x7f699f940dc0, direction=optimized out, 
count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#20 0x7f6996ad928d in explain_ExecutorRun (queryDesc=0x7f699f940dc0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#21 0x7f69968d3525 in pgss_ExecutorRun (queryDesc=0x7f699f940dc0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#22 0x7f699c745207 in PortalRunSelect (portal=0x7f699de596a0, 
forward=optimized out, count=0, dest=optimized out)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#23 0x7f699c746651 in PortalRun (portal=0x7f699de596a0, 
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f699f98c308, 
altdest=0x7f699f98c308, completionTag=0x7fff60ec5f30 )
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790
#24 0x7f699c742960 in exec_simple_query (
query_string=0x7f699dd564a0 SELECT COUNT(*) FROM \signups\  WHERE 
(signups.is_supporter = true))
at

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 Could you install the -dbg package and regenerate?

Here's another, same system, different crash:

#0  0x7fa03faf5425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fa03faf8b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fa041e1491b in errfinish (dummy=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542
#3  0x7fa041e15477 in elog_finish (elevel=optimized out, 
fmt=0x7fa041f32a48 stuck spinlock (%p) detected at %s:%d)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297
#4  0x7fa041d30db3 in s_lock_stuck (line=1099, 
file=0x7fa041f2fa78 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c,
 lock=0x7f9c2acb2ac8 \001)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40
#5  s_lock (lock=0x7f9c2acb2ac8 \001, 
file=0x7fa041f2fa78 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c,
 line=1099)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109
#6  0x7fa041d191f0 in PinBuffer (buf=0x7f9c2acb2aa8, strategy=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099
#7  0x7fa041d19af9 in BufferAlloc (foundPtr=0x7fff1948963e \001, 
strategy=0x0, blockNum=8796, forkNum=MAIN_FORKNUM, relpersistence=112 'p', 
smgr=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776
#8  ReadBuffer_common (smgr=optimized out, relpersistence=112 'p', 
forkNum=MAIN_FORKNUM, blockNum=8796, mode=RBM_NORMAL, strategy=0x0, 
hit=0x7fff194896af )
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333
#9  0x7fa041d1a53e in ReadBufferExtended (reln=0x7f9c1edd4908, 
forkNum=MAIN_FORKNUM, blockNum=8796, mode=optimized out, 
strategy=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252
#10 0x7fa041b5a706 in heapgetpage (scan=0x7fa043389050, page=8796)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:332
#11 0x7fa041b5ac12 in heapgettup_pagemode (scan=0x7fa043389050, 
dir=optimized out, nkeys=0, key=0x0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:939
#12 0x7fa041b5bf76 in heap_getnext (scan=0x7fa043389050, 
direction=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:1459
#13 0x7fa041c7a9eb in SeqNext (node=optimized out)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeSeqscan.c:66
#14 0x7fa041c6998e in ExecScanFetch (
recheckMtd=0x7fa041c7a9b0 SeqRecheck, 
accessMtd=0x7fa041c7a9c0 SeqNext, node=0x7fa0440f1c10)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:82
#15 ExecScan (node=0x7fa0440f1c10, accessMtd=0x7fa041c7a9c0 SeqNext, 
recheckMtd=0x7fa041c7a9b0 SeqRecheck)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167
#16 0x7fa041c629c8 in ExecProcNode (node=0x7fa0440f1c10)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:400
#17 0x7fa041c6f07f in agg_retrieve_direct (aggstate=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121
#18 ExecAgg (node=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013
#19 0x7fa041c628b8 in ExecProcNode (node=0x7fa0440f1510)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476
#20 0x7fa041c5fdd2 in ExecutePlan (dest=0x7fa042a955e0, 
direction=optimized out, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, planstate=0x7fa0440f1510, estate=0x7fa0440f1400)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472
#21 standard_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=optimized out, 
count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307
#22 0x7fa03c0d428d in explain_ExecutorRun (queryDesc=0x7fa0440f0ff0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233
#23 0x7fa03bece525 in pgss_ExecutorRun (queryDesc=0x7fa0440f0ff0, 
direction=ForwardScanDirection, count=0)
at 
/tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717
#24 0x7fa041d40207 in PortalRunSelect (portal=0x7fa0427061f0, 
forward=optimized out, count=0, dest=optimized out)
at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946
#25 0x7fa041d41651 in PortalRun (portal=0x7fa0427061f0, 
count=9223372036854775807, isTopLevel=1

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 Could you install the -dbg package and regenerate?

 Here's another, same system, different crash:

Both of these look like absolutely run-of-the-mill buffer access attempts.
Presumably, we are seeing the victim rather than the perpetrator of
whatever is going wrong.  Whoever is holding the spinlock is just
going down with the rest of the system ...

In a devel environment, I'd try using the postmaster's -T switch so that
it SIGSTOP's all the backends instead of SIGQUIT'ing them, and then I'd
run around and gdb all the other backends to try to see which one was
holding the spinlock and why.  Unfortunately, that's probably not
practical in a production environment; it'd take too long to collect
the stack traces by hand.  So I have no good ideas about how to debug
this, unless you can reproduce it on a devel box, or are willing to
run modified executables in production.

Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
most systems dump core files with process IDs embedded in the names.
What would be more useful today is an option to send SIGABRT, or some
other signal that would force core dumps.  Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Presumably, we are seeing the victim rather than the perpetrator of
 whatever is going wrong.

This is probing about a bit blindly, but the only thing I can see about this 
system that is in some way unique (and this is happening on multiple machines, 
so it's unlikely to be hardware) is that there are a relatively large number of 
relations (like, 440,000+) distributed over many schemas.  Is there anything 
that pins a buffer that is O(N) to the number of relations?

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Tom Lane

Christophe Pettus x...@thebuild.com writes:
 On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Presumably, we are seeing the victim rather than the perpetrator of
 whatever is going wrong.

 This is probing about a bit blindly, but the only thing I can see about this 
 system that is in some way unique (and this is happening on multiple 
 machines, so it's unlikely to be hardware) is that there are a relatively 
 large number of relations (like, 440,000+) distributed over many schemas.  Is 
 there anything that pins a buffer that is O(N) to the number of relations?

It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
And there are no loops, of any sort, that are executed while holding
such a spinlock.  At least not in the core PG code.  Are you possibly
using any nonstandard extensions?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Are you possibly using any nonstandard extensions?

No, totally stock PostgreSQL.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Andres Freund

Hi,


On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote:
 Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting 
 frequent (hourly) errors of the form:

Is it really a regular pattern like hourly? What's your
checkpoint_segments?

Could you, arround the time of a crash, check grep Dirt
/proc/meminfo and run iostat -xm 1 20?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that
 most systems dump core files with process IDs embedded in the names.
 What would be more useful today is an option to send SIGABRT, or some
 other signal that would force core dumps.  Thoughts?

I think it would be possible, at least on Linux, to have GDB connect
to the postmaster, and then automatically create new inferiors as new
backends are forked, and then have every inferior paused as
breakpoints are hit. See:

http://sourceware.org/gdb/onlinedocs/gdb/Forks.html

and

http://sourceware.org/gdb/onlinedocs/gdb/All_002dStop-Mode.html

(I think the word 'thread' is just a shorthand for 'inferior' in the
stops mode doc page, and you can definitely debug Postgres processes
in multiple inferiors today).

Now, I'm not sure how feasible this is in a production debugging
situation. It seems like an interesting way of debugging these sorts
of issues that should be explored and perhaps subsequently codified.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Andres Freund

On 2013-12-12 21:15:29 -0500, Tom Lane wrote:
 Christophe Pettus x...@thebuild.com writes:
  On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Presumably, we are seeing the victim rather than the perpetrator of
  whatever is going wrong.
 
  This is probing about a bit blindly, but the only thing I can see about 
  this system that is in some way unique (and this is happening on multiple 
  machines, so it's unlikely to be hardware) is that there are a relatively 
  large number of relations (like, 440,000+) distributed over many schemas.  
  Is there anything that pins a buffer that is O(N) to the number of 
  relations?
 
 It's not a buffer *pin* that's at issue, it's a buffer header spinlock.
 And there are no loops, of any sort, that are executed while holding
 such a spinlock.  At least not in the core PG code.  Are you possibly
 using any nonstandard extensions?

It could maybe be explained by a buffer aborting while performing
IO. Until it has call AbortBufferIO(), other backends will happily loop
in WaitIO(), constantly taking the the buffer header spinlock and
locking io_in_progress_lock in shared mode, thereby preventing
AbortBufferIO() from succeeding.

Christophe: are there any unusual ERROR messages preceding the crash,
possibly some minutes before?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 6:24 PM, Andres Freund and...@2ndquadrant.com wrote:
 Is it really a regular pattern like hourly? What's your
 checkpoint_segments?

No, it's not a pattern like that; that's an approximation.  Sometimes, they 
come in clusters, sometimes, 2-3 hours past without one.  They don't happen 
exclusively inside or outside of a checkpoint.

checkpoint_timeout = 5min
checkpoint_segments = 64
checkpoint_completion_target = 0.9

 Could you, arround the time of a crash, check grep Dirt
 /proc/meminfo and run iostat -xm 1 20?

Dirty: 30104 kB

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.700.000.910.530.00   94.85

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.83   113.131.182.01 0.06 0.45   329.29 
0.04   12.181.28   18.55   0.57   0.18
sdb   0.06   113.150.981.99 0.06 0.45   349.36 
0.24   79.303.57  116.60   1.46   0.43
md0   0.00 0.000.000.00 0.00 0.00 3.39 
0.000.000.000.00   0.00   0.00
md1   0.00 0.001.18  114.92 0.01 0.45 8.01 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.06  111.82 0.00 0.44 8.02 
0.574.880.244.89   0.04   0.43
dm-1  0.00 0.001.113.03 0.00 0.01 8.00 
1.25  300.470.38  410.89   0.17   0.07
sdc   0.00 0.00   12.10  136.13 0.5019.97   282.85 
1.94   13.072.30   14.03   0.55   8.20
dm-2  0.0039.63   24.23  272.24 1.0039.82   281.97 
1.314.441.984.65   0.44  13.03
sdd   0.00 0.00   12.13  136.11 0.5019.84   281.10 
1.359.101.649.77   0.42   6.21

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.090.000.080.130.00   98.71

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md1   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.00  558.00 0.00 8.9532.85 
7.36   13.200.00   13.20   0.12   6.80
dm-2  0.0028.000.00  558.00 0.00 8.9532.85 
7.38   13.230.00   13.23   0.12   6.80
sdd   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.380.000.170.130.00   99.33

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md1   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.00   36.00   11.00 0.18 0.1514.30 
0.061.360.673.64   0.94   4.40
dm-2  0.00 0.00   36.00   11.00 0.18 0.1514.30 
0.061.360.673.64   0.94   4.40
sdd   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.830.000.290.040.00   98.83

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.00

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 6:41 PM, Andres Freund and...@2ndquadrant.com wrote:

 Christophe: are there any unusual ERROR messages preceding the crash,
 possibly some minutes before?

Interestingly, each spinlock PANIC is *followed*, about one minute later (+/- 
five seconds) by a canceling statement due to statement timeout on that exact 
query.  The queries vary enough in text that it is unlikely to be a coincidence.

There are a *lot* of canceling statement due to statement timeout messages, 
which is interesting, because:

postgres=# show statement_timeout;
 statement_timeout 
---
 0
(1 row)

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Peter Geoghegan

On Thu, Dec 12, 2013 at 7:35 PM, Christophe Pettus x...@thebuild.com wrote:
 There are a *lot* of canceling statement due to statement timeout messages, 
 which is interesting, because:

 postgres=# show statement_timeout;
  statement_timeout
 ---
  0
 (1 row)

Couldn't that just be the app setting it locally? In fact, isn't that
the recommended usage?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2013-12-12 Thread Christophe Pettus


On Dec 12, 2013, at 7:40 PM, Peter Geoghegan p...@heroku.com wrote:
 Couldn't that just be the app setting it locally?

Yes, that's what is happening there (I had to check with the client's 
developers).  It's possible that the one-minute repeat is due to the 
application reissuing the query, rather than specifically related to the 
spinlock issue.  What this does reveal is that all the spinlock issues have 
been on long-running queries, for what it is worth.

--
-- Christophe Pettus
   x...@thebuild.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stuck spinlock

2001-02-28 Thread Peter Schindler


Tom Lane wrote:
 Judging from the line number, this is in CreateCheckPoint.  I'm
 betting that your platform (Solaris 2.7, you said?) has the same odd
 behavior that I discovered a couple days ago on HPUX: a select with
 a delay of tv_sec = 0, tv_usec = 100 doesn't delay 1 second like
 a reasonable person would expect, but fails instantly with EINVAL.
After I finally understood what you meant, this behavior looks somehow
reasonable to me as its a struct, but I must admit, that I don't have 
to much knowledge in this area.

Anyway, after further thoughts I was curious about this odd behavior on
the different platforms and I used your previously posted program, extended
it a little bit and run it on all platforms I could get a hold of.
Please have a look at the extracted log and comments below about the different 
platforms. It seems, that this functions a "good" example of a really 
incompatible implementation across platforms, even within the same 
across different versions of the OSs.
Happy wondering ;-)


 In short: please try the latest nightly snapshot (this fix is since
 beta5, unfortunately) and let me know if you still see a problem.
I did and I didn't get the error yet, but didn't run as many jobs either.
If I get the error again, I'll post it.

Thanks for your help,
Peter

=

AIX 4.3.3
=

 Delay |   elapsed Time|   actual Wait
---
 0 |  0.0 [msec/loop] | 0.0 [msec/sel]
   500 | 10.3 [msec/loop] | 1.0 [msec/sel]
  1000 | 10.3 [msec/loop] | 1.0 [msec/sel]
  1500 | 15.3 [msec/loop] | 1.5 [msec/sel]
  2000 | 20.3 [msec/loop] | 2.0 [msec/sel]
...
98 | 9800.3 [msec/loop] | 980.0 [msec/sel]
99 | 9899.3 [msec/loop] | 989.9 [msec/sel]
 -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 100 |  0.1 [msec/loop] | 0.0 [msec/sel]
 -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 101 |  0.1 [msec/loop] | 0.0 [msec/sel]


other more granular run with steps of 10 usec after 1000:

NUM_OF_LOOPS: 1000
 Delay |   elapsed Time|   actual Wait
---
 0 | 2.7090 [msec/loop] | 0.0027 [msec/sel]
   100 | 1024.7370 [msec/loop] | 1.0247 [msec/sel]
   200 | 1024.3160 [msec/loop] | 1.0243 [msec/sel]
   300 | 1024.6510 [msec/loop] | 1.0247 [msec/sel]
   400 | 1024.5030 [msec/loop] | 1.0245 [msec/sel]
   500 | 1024.5400 [msec/loop] | 1.0245 [msec/sel]
   600 | 1024.8340 [msec/loop] | 1.0248 [msec/sel]
   700 | 1024.3110 [msec/loop] | 1.0243 [msec/sel]
   800 | 1024.7030 [msec/loop] | 1.0247 [msec/sel]
   900 | 1024.4560 [msec/loop] | 1.0245 [msec/sel]
  1000 | 1024.2810 [msec/loop] | 1.0243 [msec/sel]
  1010 | 1034.4840 [msec/loop] | 1.0345 [msec/sel]
  1020 | 1044.0490 [msec/loop] | 1.0440 [msec/sel]
  1030 | 1054.3530 [msec/loop] | 1.0544 [msec/sel]
  1040 | 1064.6620 [msec/loop] | 1.0647 [msec/sel]
  1050 | 1074.0980 [msec/loop] | 1.0741 [msec/sel]
  1060 | 1084.4850 [msec/loop] | 1.0845 [msec/sel]
  1070 | 1094.1270 [msec/loop] | 1.0941 [msec/sel]
  1080 | 1104.4080 [msec/loop] | 1.1044 [msec/sel]
  1090 | 1132.8880 [msec/loop] | 1.1329 [msec/sel]
  1100 | 1124.2220 [msec/loop] | 1.1242 [msec/sel]

Comments:
 o minimum is 1 msec until 1000 usec and than it 
   tries to respect the actual number in usec
 o usec = 1 sec not allowed



HP-UX 10.20
===
NUM_OF_LOOPS: 10
 Delay |   elapsed Time|   actual Wait
---
 0 |  0.1 [msec/loop] | 0.0 [msec/sel]
   500 | 97.6 [msec/loop] | 9.8 [msec/sel]
  1000 | 100.0 [msec/loop] | 10.0 [msec/sel]
  1500 | 100.0 [msec/loop] | 10.0 [msec/sel]
...
 14000 | 100.0 [msec/loop] | 10.0 [msec/sel]
 14500 | 100.2 [msec/loop] | 10.0 [msec/sel]
 15000 | 199.8 [msec/loop] | 20.0 [msec/sel]
 15500 | 200.0 [msec/loop] | 20.0 [msec/sel]
...
 24000 | 200.0 [msec/loop] | 20.0 [msec/sel]
 24500 | 200.0 [msec/loop] | 20.0 [msec/sel]
 25000 | 300.0 [msec/loop] | 30.0 [msec/sel]
 25500 | 300.0 [msec/loop] | 30.0 [msec/sel]
...
98 | 9800.1 [msec/loop] | 980.0 [msec/sel]
99 | 9900.0 [msec/loop] | 990.0 [msec/sel]
 -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 100 |  0.1 [msec/loop] | 0.0 [msec/sel]
 -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 101 |  0.1 [msec/loop] | 0.0 [msec/sel]

Comments:
 o minimum is 10 msec until 1000 usec 
 o after 1000 it rounds down or up to the next 10msec
 o usec = 1 sec not allowed



HP-UX 11
===
NUM_OF_LOOPS: 10
 Delay |   elapsed Time|   actual Wait
---
 0 | 92.7 [msec/loop] | 9.3 [msec/sel]
   500 | 99.9 [msec/loop] | 10.0 [msec/sel]
  1000 | 99.8 [msec/loop] | 10.0 [msec/sel]
  1500 | 100.0 [msec/loop] | 10.0 [msec/sel]
...
  9000 | 99.9 [msec/loop] | 10.0 [msec/sel]
  9500 | 100.1 [msec/loop] | 10.0 [msec/sel]
 1 | 199.9 [msec/loop] | 20.0 [msec/sel]
 10500 | 199.9

Re: [HACKERS] stuck spinlock

2001-02-28 Thread Tom Lane


Interesting numbers --- thanks for sending them along.

Looks like I was mistaken to think that most platforms would allow
tv_usec = 1 sec.  Ah well, another day, another bug...

regards, tom lane

[HACKERS] stuck spinlock

2001-02-26 Thread Peter Schindler


Can anyone tell me what is going on, when I get a stuck spinlock?
Is there a data corruption or anything else to worry about ?
I've found some references about spinlocks in the -hackers list,
so is that fixed with a later version than beta4 already?

Actually I was running a stack of pgbench jobs with varying commit_delay
parameter and # of clients, but it doesn't look deterministic on any of their
values. 
I've got those fatal errors, with exactly the same data several times now. 
I've restarted the postmaster as well as I've dropped the bench database and 
recreated it, but it didn't really help. That error is still coming
*sometimes*.
BTW, I think I didn't see this before, when I was running pgbench only
once from the command line, but since I use the script with the for
loop.


Some environment info:

bench=# select version();
   version   
-
 PostgreSQL 7.1beta4 on sparc-sun-solaris2.7, compiled by GCC 2.95.1

checkpoint_timeout = 1800 # range 30-1800
commit_delay = 0 # range 0-1000
debug_level = 0 # range 0-16
fsync = false
max_connections = 100 # 1-1024
shared_buffers = 4096
sort_mem = 4096
tcpip_socket = true
wal_buffers = 128 # min 4
wal_debug = 0 # range 0-16
wal_files = 10 # range 0-64


pgbench -i -s 10 bench
...
PGOPTIONS="-c commit_delay=$del " \
pgbench -c $cli -t 100 -n bench

Thanks,
Peter

=

FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting.

FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting.
Server process (pid 7889) exited with status 6 at Mon Feb 26 09:17:36 2001
Terminating any active server processes...
NOTICE:  Message from PostgreSQL backend:
The Postmaster has informed me that some other backend  died abnormally and 
possibly corr
upted shared memory.
I have rolled back the current transaction and am   going to terminate 
your database 
system connection and exit.
Please reconnect to the database system and repeat your query.
The Data Base System is in recovery mode
Server processes were terminated at Mon Feb 26 09:17:36 2001
Reinitializing shared memory and semaphores
DEBUG:  starting up
DEBUG:  database system was interrupted at 2001-02-26 09:17:33
DEBUG:  CheckPoint record at (0, 3648965776)
DEBUG:  Redo record at (0, 3648965776); Undo record at (0, 0); Shutdown FALSE
DEBUG:  NextTransactionId: 1362378; NextOid: 2362993
DEBUG:  database system was not properly shut down; automatic recovery in progress...
DEBUG:  redo starts at (0, 3648965840)
DEBUG:  ReadRecord: record with zero len at (0, 3663163376)
DEBUG:  Formatting logfile 0 seg 218 block 699 at offset 4080
DEBUG:  The last logId/logSeg is (0, 218)
DEBUG:  redo done at (0, 3663163336)

-- 
Best regards,
Peter Schindler 
   Synchronicity Inc.|  [EMAIL PROTECTED]
   http://www.synchronicity.com  |  +49 89 89 66 99 42 (Germany)

Re: [HACKERS] stuck spinlock

2001-02-26 Thread Tom Lane


Peter Schindler [EMAIL PROTECTED] writes:
 FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting.

Judging from the line number, this is in CreateCheckPoint.  I'm
betting that your platform (Solaris 2.7, you said?) has the same odd
behavior that I discovered a couple days ago on HPUX: a select with
a delay of tv_sec = 0, tv_usec = 100 doesn't delay 1 second like
a reasonable person would expect, but fails instantly with EINVAL.
This causes the spinlock timeout in CreateCheckPoint to effectively
be only a few microseconds rather than the intended ten minutes.
So, if the postmaster happens to fire off a checkpoint process while
some regular backend is doing something with the WAL log, kaboom.

In short: please try the latest nightly snapshot (this fix is since
beta5, unfortunately) and let me know if you still see a problem.

regards, tom lane

[HACKERS] Stuck Spinlock (fwd) - m68k architecture, 7.0.3

2001-02-05 Thread Oliver Elphick


Has anyone got PostgreSQL 7.0.3 working on m68k architecture?

Russell is trying to install it on m68k and is consistently getting a
stuck spinlock in initdb.   He used to have 6.3.2 working. Both 6.5.3
and 7.0.3 fail.

His message shows that the first attempt to set a lock fails.

--- Forwarded Message

Date:Mon, 05 Feb 2001 09:03:21 -0500
From:Russell Hires [EMAIL PROTECTED]
To:  [EMAIL PROTECTED]
Subject: Stuck Spinlock

Hey, here is the spinlock test results...

Thanks!

Russell


rusty@smurfette:~/postgresql-7.0.3/src/backend/storage/buffer$ make
s_lock_test
gcc -I../../../include -I../../../backend   -O2 -g -g3 -Wall
- -Wmissing-prototypes -Wmissing-declarations -I../.. -DS_LOCK_TEST=1 s_lock.c
- -o s_lock_test
s_lock.c:251: warning: return type of `main' is not `int'
./s_lock_test

FATAL: s_lock(80002974) at s_lock.c:260, stuck spinlock. Aborting.

FATAL: s_lock(80002974) at s_lock.c:260, stuck spinlock. Aborting.
make: *** [s_lock_test] Aborted
make: *** Deleting file `s_lock_test'




--- End of Forwarded Message


-- 
Oliver Elphick[EMAIL PROTECTED]
Isle of Wight  http://www.lfix.co.uk/oliver
PGP: 1024R/32B8FAA1: 97 EA 1D 47 72 3F 28 47  6B 7E 39 CC 56 E4 C1 47
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
 
 "Lift up your heads, O ye gates; and be ye lift up, ye 
  everlasting doors; and the King of glory shall come 
  in. Who is this King of glory? The LORD strong and 
  mighty, the LORD mighty in battle."   
   Psalms 24:7,8

63 matches

Mail list logo