Re: [HACKERS] stuck spinlock
On 12/12/13, 8:45 PM, Tom Lane wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. Which systems are those? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote: On 12/12/13, 8:45 PM, Tom Lane wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. Which systems are those? MacOS X dumps core files into /cores/core.$PID, and at least some Linux systems seem to dump them into ./core.$PID I don't know how universal this is. I think a bigger objection to the SIGSTOP stuff is that a lot of bugs are too real time to ever be meaningfully caught that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Thu, Dec 26, 2013 at 03:18:23PM -0800, Robert Haas wrote: On Thu, Dec 26, 2013 at 11:54 AM, Peter Eisentraut pete...@gmx.net wrote: On 12/12/13, 8:45 PM, Tom Lane wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. Which systems are those? MacOS X dumps core files into /cores/core.$PID, and at least some Linux systems seem to dump them into ./core.$PID On Linux it's configurable and at least on Ubuntu you get this: $ cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c But yes, it can be configured to icnclude the PID in the filename. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] stuck spinlock
On 2013-12-12 20:45:17 -0500, Tom Lane wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. What would be more useful today is an option to send SIGABRT, or some other signal that would force core dumps. Thoughts? Although I didn't know of the option, I thought having it would be useful before. It allows you to inspect the memory of the individual backends while they are still alive - which allows gdb to call functions. Which surely is helpful when debugging some issues. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Mon, Dec 16, 2013 at 6:46 AM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@2ndquadrant.com writes: Hard to say, the issues fixed in the release are quite important as well. I'd tend to say they are more important. I think we just need to release 9.3.3 pretty soon. Yeah. Has there been any talk about when a 9.3.3 (and/or 9.2.7?) patch might be released?
Re: [HACKERS] stuck spinlock
On Sat, Dec 14, 2013 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote: On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yeah, but for the record (since I see I got cc'd here), that's not my fault. I moved it into bgworker.c, but it's been like that since Alvaro's original commit of the bgworker facility (da07a1e856511dca59cbb1357616e26baa64428e). Is this an edge case or something that will hit a lot of users? Arbitrary server panics seems pretty serious... Is your question about the bgworker part you're quoting or about the stuck spinlock stuff? I don't think the bgworker bug is too bad in practice but the one in handle_sig_alarm() stuff certainly is. I think while it looks possible to hit problems without statement/lock timeout, it's relatively unlikely that those are hit in practice. Well, both -- I was just wondering out loud what the severity level of this issue was. In particular, is it advisable for the general public avoid this release? My read on this is 'probably'. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-16 08:36:51 -0600, Merlin Moncure wrote: On Sat, Dec 14, 2013 at 6:20 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote: Is this an edge case or something that will hit a lot of users? Arbitrary server panics seems pretty serious... Is your question about the bgworker part you're quoting or about the stuck spinlock stuff? I don't think the bgworker bug is too bad in practice but the one in handle_sig_alarm() stuff certainly is. I think while it looks possible to hit problems without statement/lock timeout, it's relatively unlikely that those are hit in practice. Well, both -- I was just wondering out loud what the severity level of this issue was. In particular, is it advisable for the general public avoid this release? My read on this is 'probably'. Hard to say, the issues fixed in the release are quite important as well. I'd tend to say they are more important. I think we just need to release 9.3.3 pretty soon. The multixact fixes in 9.3.2 weren't complete either... (see recent push) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: Hard to say, the issues fixed in the release are quite important as well. I'd tend to say they are more important. I think we just need to release 9.3.3 pretty soon. Yeah. The multixact fixes in 9.3.2 weren't complete either... (see recent push) Are they complete now? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-16 09:46:19 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: The multixact fixes in 9.3.2 weren't complete either... (see recent push) Are they complete now? Hm. There's two issues I know of left, both discovered in #8673: - slru.c:SlruScanDirectory() doesn't support long enough filenames. Afaics that should be a fairly easy fix. - multixact/members isn't protected against wraparounds, only multixact/offsets is. That's a pretty longstanding bug though, although more likely to be hit these days. Furthermore there's some missing optimizations (like the useless multixact generation you noted upon in Update with subselect sometimes returns wrong result), but those shouldn't hold up a release. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: On 2013-12-16 09:46:19 -0500, Tom Lane wrote: Are they complete now? Hm. There's two issues I know of left, both discovered in #8673: - slru.c:SlruScanDirectory() doesn't support long enough filenames. Afaics that should be a fairly easy fix. - multixact/members isn't protected against wraparounds, only multixact/offsets is. That's a pretty longstanding bug though, although more likely to be hit these days. Actually, isn't this one a must-fix as well? http://www.postgresql.org/message-id/CAPweHKe5QQ1747X2c0tA=5zf4yns2xcvgf13opd-1mq24rf...@mail.gmail.com regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Tom Lane escribió: Andres Freund and...@2ndquadrant.com writes: On 2013-12-16 09:46:19 -0500, Tom Lane wrote: Are they complete now? Hm. There's two issues I know of left, both discovered in #8673: - slru.c:SlruScanDirectory() doesn't support long enough filenames. Afaics that should be a fairly easy fix. - multixact/members isn't protected against wraparounds, only multixact/offsets is. That's a pretty longstanding bug though, although more likely to be hit these days. Actually, isn't this one a must-fix as well? http://www.postgresql.org/message-id/CAPweHKe5QQ1747X2c0tA=5zf4yns2xcvgf13opd-1mq24rf...@mail.gmail.com Yep, I'm going through that one now. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 15:49:45 -0600, Merlin Moncure wrote: On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yeah, but for the record (since I see I got cc'd here), that's not my fault. I moved it into bgworker.c, but it's been like that since Alvaro's original commit of the bgworker facility (da07a1e856511dca59cbb1357616e26baa64428e). Is this an edge case or something that will hit a lot of users? Arbitrary server panics seems pretty serious... Is your question about the bgworker part you're quoting or about the stuck spinlock stuff? I don't think the bgworker bug is too bad in practice but the one in handle_sig_alarm() stuff certainly is. I think while it looks possible to hit problems without statement/lock timeout, it's relatively unlikely that those are hit in practice. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Hi, On 2013-12-13 15:57:14 -0300, Alvaro Herrera wrote: If there was a way for raising an #error at compile time whenever a worker relies on the existing signal handler, I would vote for doing that. (But then I have no idea how to do such a thing.) I don't see a way either given how disconnected registration of the signal handler is from the bgworker infrastructure. I think the best we can do is to raise an error in BackgroundWorkerUnblockSignals() - and we should definitely do that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 13:39:42 -0500, Robert Haas wrote: On Fri, Dec 13, 2013 at 1:15 PM, Andres Freund and...@2ndquadrant.com wrote: Agreed on not going forward like now, but I don't really see how they could usefully use die(). I think we should just mandate that every bgworker conneced to shared memory registers a sigterm handler - we could put a check into BackgroundWorkerUnblockSignals(). We should leave the current handler in for unconnected one though... bgworkers are supposed to be written as a loop around procLatch, so adding a !got_sigterm, probably isn't too hard. I think the !got_sigterm thing is complete bunk. If a background worker is running SQL queries, it really ought to honor a query cancel or sigterm at the next CHECK_FOR_INTERRUPTS(). I am not convinced by the necessity of that, not in general. After all, the code is using a bgworker and not a normal backend for a reason. If you e.g. have queing code, it very well might need to serialize its state to disk before shutting down. Checking whether the bgworker should shut down every iteration of the mainloop sounds appropriate to me for such cases. But I think we should provide a default handler that does the necessary things to interrupt queries, so bgworker authors don't have to do it themselves and, just as importantly, we can more easily add new stuff there. +static void +handle_sigterm(SIGNAL_ARGS) +{ + int save_errno = errno; + + if (MyProc) + SetLatch(MyProc-procLatch); + + if (!proc_exit_inprogress) + { + InterruptPending = true; + ProcDiePending = true; + } + + errno = save_errno; +} ...but I'm not 100% sure that's right, either. If you want a bgworker to behave as close as a normal backend we should probably really do the full dance die() does. Specifically call ProcessInterrupts() immediately if ImmediateInterruptOK allows it, otherwise we'd just continue waiting for locks and similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Hi, On 2013-12-12 19:35:36 -0800, Christophe Pettus wrote: On Dec 12, 2013, at 6:41 PM, Andres Freund and...@2ndquadrant.com wrote: Christophe: are there any unusual ERROR messages preceding the crash, possibly some minutes before? Interestingly, each spinlock PANIC is *followed*, about one minute later (+/- five seconds) by a canceling statement due to statement timeout on that exact query. The queries vary enough in text that it is unlikely to be a coincidence. There are a *lot* of canceling statement due to statement timeout messages, which is interesting, because: Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5? Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()? ISTM nothing is preventing us from jumping out of code holding a spinlock? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5? Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()? ISTM nothing is preventing us from jumping out of code holding a spinlock? Hm ... what should stop it is that ImmediateInterruptOK wouldn't be set while we're messing with any spinlocks. Except that ProcessInterrupts doesn't check that gating condition :-(. I think you're probably right: what should be in the interrupt handler is something like if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS(); regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 09:52:06 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Tom, could this be caused by c357be2cd9434c70904d871d9b96828b31a50cc5? Specifically the added CHECK_FOR_INTERRUPTS() in handle_sig_alarm()? ISTM nothing is preventing us from jumping out of code holding a spinlock? Hm ... what should stop it is that ImmediateInterruptOK wouldn't be set while we're messing with any spinlocks. Except that ProcessInterrupts doesn't check that gating condition :-(. It really can't, right? Otherwise explicit CHECK_FOR_INTERRUPTS()s in normal code wouldn't do much anymore since ImmediateInterruptOK is so seldomly set. The control flow around signal handling always drives me crazy. I think you're probably right: what should be in the interrupt handler is something like if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS(); Yea, that sounds right. Or just don't set process interrupts there, it doesn't seem to be required for correctness? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: On 2013-12-13 09:52:06 -0500, Tom Lane wrote: I think you're probably right: what should be in the interrupt handler is something like if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS(); Yea, that sounds right. Or just don't set process interrupts there, it doesn't seem to be required for correctness? It is if we need to break out of a wait-for-lock ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 10:30:48 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-12-13 09:52:06 -0500, Tom Lane wrote: I think you're probably right: what should be in the interrupt handler is something like if (ImmediateInterruptOK) CHECK_FOR_INTERRUPTS(); Yea, that sounds right. Or just don't set process interrupts there, it doesn't seem to be required for correctness? It is if we need to break out of a wait-for-lock ... Right, that uses MyProc-sem and not MyProc-procLatch... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On closer inspection, I'm thinking that actually it'd be a good idea if handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt: it should save, clear, and restore ImmediateInterruptOK, so as to make the world safe for timeout handlers to do things that might include a CHECK_FOR_INTERRUPTS. And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. I didn't find any other similar hazards in a quick look through all our signal handlers. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 11:26:44 -0500, Tom Lane wrote: On closer inspection, I'm thinking that actually it'd be a good idea if handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt: it should save, clear, and restore ImmediateInterruptOK, so as to make the world safe for timeout handlers to do things that might include a CHECK_FOR_INTERRUPTS. Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any eventual ProcessInterrupts() in the timeout handlers from doing anything harmful? Even if so, making sure ImmediateInterruptOK is preserved seems worthwile anyway. And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yes, looks broken to me. I didn't find any other similar hazards in a quick look through all our signal handlers. One thing I randomly noticed just now is the following in RecoveryConflictInterrupt(): elog(FATAL, unrecognized conflict mode: %d, (int) reason); obviously that's not really ever going to hit, but it should either be a PANIC or an Assert() for the reasons you cite. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: Yes, that's what is happening there (I had to check with the client's developers). It's possible that the one-minute repeat is due to the application reissuing the query, rather than specifically related to the spinlock issue. What this does reveal is that all the spinlock issues have been on long-running queries, for what it is worth. Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see if that doesn't fix it for you. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see if that doesn't fix it for you. Great, thanks. Would the statement_timeout firing invoke this path? (I'm wondering why this particular installation was experiencing this.) -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: On 2013-12-13 11:26:44 -0500, Tom Lane wrote: On closer inspection, I'm thinking that actually it'd be a good idea if handle_sig_alarm did what we do in, for example, HandleCatchupInterrupt: it should save, clear, and restore ImmediateInterruptOK, so as to make the world safe for timeout handlers to do things that might include a CHECK_FOR_INTERRUPTS. Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any eventual ProcessInterrupts() in the timeout handlers from doing anything harmful? Sorry, I misspoke there. The case I'm worried about is doing something like a wait for lock, which would unconditionally set and then reset ImmediateInterruptOK. That's not very plausible perhaps, but on the other hand we are calling DeadLockCheck() in there, and who knows what future timeout handlers might try to do? BTW, I'm about to go put a HOLD_INTERRUPTS/RESUME_INTERRUPTS into HandleCatchupInterrupt and HandleNotifyInterrupt too, for essentially the same reason. At least the first of these *does* include semaphore ops, so I think it's theoretically vulnerable to losing control if a timeout occurs while it's waiting for a semaphore. There's probably no real bug today because I don't think we enable catchup interrupts at any point where a timeout would be active, but that doesn't sound terribly future-proof. If a timeout did happen, holding off interrupts would have the effect of postponing the query cancel till we're done with the catchup interrupt, which seems reasonable. One thing I randomly noticed just now is the following in RecoveryConflictInterrupt(): elog(FATAL, unrecognized conflict mode: %d, (int) reason); obviously that's not really ever going to hit, but it should either be a PANIC or an Assert() for the reasons you cite. Yeah, PANIC there seems good. I also thought about using START_CRIT_SECTION/END_CRIT_SECTION instead of HOLD_INTERRUPTS/RESUME_INTERRUPTS in these signal handlers. That would both hold off interrupts and cause any elog(ERROR/FATAL) within the handler to be promoted to PANIC. But I'm not sure that'd be a net stability improvement... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see if that doesn't fix it for you. Great, thanks. Would the statement_timeout firing invoke this path? (I'm wondering why this particular installation was experiencing this.) Yeah, the problem is that either statement_timeout or lock_timeout could cause control to be taken away from code that thinks it's straight-line code and so doesn't have provision for getting cleaned up at transaction abort. Spinlocks certainly fall in that category. I'm afraid other weird failures are possible, though I'm not sure what. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 12:19:56 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Shouldn't the HOLD_INTERRUPTS() in handle_sig_alarm() prevent any eventual ProcessInterrupts() in the timeout handlers from doing anything harmful? Sorry, I misspoke there. The case I'm worried about is doing something like a wait for lock, which would unconditionally set and then reset ImmediateInterruptOK. I sure hope we're not going to introduce more paths that do this, but I am not going to bet on it... I remember trying to understand why the deadlock detector is safe doing as it does when I was all green and was trying to understand the HS patch and it drove me nuts. BTW, I'm about to go put a HOLD_INTERRUPTS/RESUME_INTERRUPTS into HandleCatchupInterrupt and HandleNotifyInterrupt too, for essentially the same reason. Sounds good, both already do a ProcessInterrupts() at their end, so the holdoff shouldn't lead to absorbed interrupts. I wonder what to do about bgworker's bgworker_die()? I don't really see how that can be fixed without breaking the API? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Andres Freund and...@2ndquadrant.com writes: I wonder what to do about bgworker's bgworker_die()? I don't really see how that can be fixed without breaking the API? IMO it should be flushed and bgworkers should use the same die() handler as every other backend, or else one like the one in worker_spi, which just sets a flag for testing later. If we try to change the signal handling contracts, 80% of backend code will be unusable in bgworkers, which is not where we want to be I think. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-13 12:54:09 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I wonder what to do about bgworker's bgworker_die()? I don't really see how that can be fixed without breaking the API? IMO it should be flushed and bgworkers should use the same die() handler as every other backend, or else one like the one in worker_spi, which just sets a flag for testing later. Agreed on not going forward like now, but I don't really see how they could usefully use die(). I think we should just mandate that every bgworker conneced to shared memory registers a sigterm handler - we could put a check into BackgroundWorkerUnblockSignals(). We should leave the current handler in for unconnected one though... bgworkers are supposed to be written as a loop around procLatch, so adding a !got_sigterm, probably isn't too hard. It sucks that people might have bgworkers out there that don't register their own sigterm handlers, but adding a sigterm handler will be backward compatible and it's in the example bgworker, so it's probably not too bad. If we try to change the signal handling contracts, 80% of backend code will be unusable in bgworkers, which is not where we want to be I think. Yea, I think that's out of the question. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yeah, but for the record (since I see I got cc'd here), that's not my fault. I moved it into bgworker.c, but it's been like that since Alvaro's original commit of the bgworker facility (da07a1e856511dca59cbb1357616e26baa64428e). While I was developing the shared memory message queueing stuff, I experimented using die() as the signal handler and didn't have very good luck. I can't remember exactly what wasn't working any more, though. I agree that it would be good if we can make that work. Right now we've got other modules growing warts like WalRcvImmediateInterruptOK, which doesn't seem good. It seems to me that we should change every place that temporarily changes ImmediateInterruptOK to restore the original value instead of making assumptions about what it must have been. ClientAuthentication(), md5_crypt_verify(), PGSemaphoreLock() and WalSndLoop() all have this disease. I also really wonder if notify and catchup interrupts ought to be taught to respect ImmediateInterruptOK, instead of having their own switches for the same thing. Right now there are an awful lot of places that do this: ImmediateInterruptOK = false; /* not idle anymore */ DisableNotifyInterrupt(); DisableCatchupInterrupt(); ...and that doesn't seem like a good thing. Heaven forfend someone were to do only two out of the three. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Fri, Dec 13, 2013 at 1:15 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-12-13 12:54:09 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I wonder what to do about bgworker's bgworker_die()? I don't really see how that can be fixed without breaking the API? IMO it should be flushed and bgworkers should use the same die() handler as every other backend, or else one like the one in worker_spi, which just sets a flag for testing later. Agreed on not going forward like now, but I don't really see how they could usefully use die(). I think we should just mandate that every bgworker conneced to shared memory registers a sigterm handler - we could put a check into BackgroundWorkerUnblockSignals(). We should leave the current handler in for unconnected one though... bgworkers are supposed to be written as a loop around procLatch, so adding a !got_sigterm, probably isn't too hard. I think the !got_sigterm thing is complete bunk. If a background worker is running SQL queries, it really ought to honor a query cancel or sigterm at the next CHECK_FOR_INTERRUPTS(). But the default background worker handler for SIGUSR1 just sets the process latch, and worker_spi's sigterm handler just sets a private variable got_sigterm. So ProcessInterrupts() will never get called, and if it did it wouldn't do anything anyway. That's really pretty horrible, because it means that the query worker_spi runs can't be interrupted short of a SIGQUIT. So I think worker_spi is really a very bad example of how to do this right. In the as-yet-uncommitted test-shm-mq-v1.patch, I did this: +static void +handle_sigterm(SIGNAL_ARGS) +{ + int save_errno = errno; + + if (MyProc) + SetLatch(MyProc-procLatch); + + if (!proc_exit_inprogress) + { + InterruptPending = true; + ProcDiePending = true; + } + + errno = save_errno; +} ...but I'm not 100% sure that's right, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Robert Haas escribió: On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yeah, but for the record (since I see I got cc'd here), that's not my fault. I moved it into bgworker.c, but it's been like that since Alvaro's original commit of the bgworker facility (da07a1e856511dca59cbb1357616e26baa64428e). I see the blame falls on me ;-) I reckon I blindly copied this stuff from elsewhere without thinking very much about it. As noted upthread, even the example code uses a different handler for SIGTERM. There wasn't much else that we could do; simply letting the generic code run without any SIGTERM installed didn't seem the right thing to do. (You probably recall that the business of starting workers with signals blocked was installed later.) I found a few workers in github in a quick search: https://github.com/umitanuki/mongres/blob/master/mongres.c https://github.com/markwkm/pg_httpd/blob/master/pg_httpd.c https://github.com/ibarwick/config_log/blob/master/config_log.c https://github.com/gleu/stats_recorder/blob/master/stats_recorder_spi.c https://github.com/michaelpq/pg_workers/blob/master/kill_idle/kill_idle.c https://github.com/le0pard/pg_web/blob/master/src/pg_web.c Not a single one of the uses bgworker_die() -- they all follow worker_spi's lead of setting a got_sigterm flag and SetLatch(). If there was a way for raising an #error at compile time whenever a worker relies on the existing signal handler, I would vote for doing that. (But then I have no idea how to do such a thing.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Robert Haas robertmh...@gmail.com writes: It seems to me that we should change every place that temporarily changes ImmediateInterruptOK to restore the original value instead of making assumptions about what it must have been. No, that's backwards. The problem isn't that it could be sane to enter, say, PGSemaphoreLock with ImmediateInterruptOK already true; to get there, you'd have had to pass through boatloads of code in which it patently isn't safe for that to be the case. Rather, the problem is that once you get there it might *still* be unsafe to throw an error. HOLD/RESUME_INTERRUPTS are designed to handle exactly that problem. The only other way we could handle it would be if every path from (say) HandleNotifyInterrupt down to PGSemaphoreLock passed a bool flag to tell it don't turn on ImmediateInterruptOK; which is pretty unworkable. I also really wonder if notify and catchup interrupts ought to be taught to respect ImmediateInterruptOK, instead of having their own switches for the same thing. They're not switches for the same thing though; the effects are different, and in fact there are places that do and should flip only some of these, PGSemaphoreLock being just the most obvious one. I agree that it might be possible to simplify things, but it would take more thought than you seem to have put into it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Fri, Dec 13, 2013 at 12:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 13, 2013 at 11:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: And while we're on the subject ... isn't bgworker_die() utterly and completely broken? That unconditional elog(FATAL) means that no process using that handler can do anything remotely interesting, like say touch shared memory. Yeah, but for the record (since I see I got cc'd here), that's not my fault. I moved it into bgworker.c, but it's been like that since Alvaro's original commit of the bgworker facility (da07a1e856511dca59cbb1357616e26baa64428e). Is this an edge case or something that will hit a lot of users? Arbitrary server panics seems pretty serious... merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 13, 2013, at 1:49 PM, Merlin Moncure mmonc...@gmail.com wrote: Is this an edge case or something that will hit a lot of users? My understanding (Tom can correct me if I'm wrong, I'm sure) is that it is an issue for servers on 9.3.2 where there are a lot of query cancellations due to facilities like statement_timeout or lock_timeout that cancel a query asynchronously. I assume pg_cancel_backend() would apply as well. We've only seen it on one client, and that client had a *lot* (thousands on thousands) of statement_timeout cancellations. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 13, 2013, at 8:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Please apply commit 478af9b79770da43a2d89fcc5872d09a2d8731f8 and see if that doesn't fix it for you. It appears to fix it. Thanks! -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] stuck spinlock
Greetings, Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form: /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux. Generally, there's no core file (which is currently enable), as the postmaster just normally exits the backend. Diagnosis suggestions? -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form: /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted uname -a: Linux postgresql3-master 3.8.0-33-generic #48~precise1-Ubuntu SMP Thu Oct 24 16:28:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux. Generally, there's no core file (which is currently enable), as the postmaster just normally exits the backend. Hm, a PANIC really ought to result in a core file. You sure you don't have that disabled (perhaps via a ulimit setting)? As for the root cause, it's hard to say. The file/line number says it's a buffer header lock that's stuck. I rechecked all the places that lock buffer headers, and all of them have very short code paths to the corresponding unlock, so there's no obvious explanation how this could happen. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote: Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form: /var/lib/postgresql/9.3/main/pg_log/postgresql-2013-12-12_211710.csv:2013-12-12 21:40:10.328 UTC,n,n,32376,10.2.1.142:52451,52aa24eb.7e78,5,SELECT,2013-12-12 21:04:43 UTC,9/7178,0,PANIC,XX000,stuck spinlock (0x7f7df94672f4) detected at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099,,redacted Any other changes but the upgrade? Maybe a different compiler version? Also, could you share some details about the workload? Highly concurrent? Standby? ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Thu, Dec 12, 2013 at 3:33 PM, Andres Freund and...@2ndquadrant.com wrote: Any other changes but the upgrade? Maybe a different compiler version? Show pg_config output. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 3:37 PM, Peter Geoghegan p...@heroku.com wrote: Show pg_config output. Below; it's the Ubuntu package. BINDIR = /usr/lib/postgresql/9.3/bin DOCDIR = /usr/share/doc/postgresql-doc-9.3 HTMLDIR = /usr/share/doc/postgresql-doc-9.3 INCLUDEDIR = /usr/include/postgresql PKGINCLUDEDIR = /usr/include/postgresql INCLUDEDIR-SERVER = /usr/include/postgresql/9.3/server LIBDIR = /usr/lib PKGLIBDIR = /usr/lib/postgresql/9.3/lib LOCALEDIR = /usr/share/locale MANDIR = /usr/share/postgresql/9.3/man SHAREDIR = /usr/share/postgresql/9.3 SYSCONFDIR = /etc/postgresql-common PGXS = /usr/lib/postgresql/9.3/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/tcl8.5' '--with-tkconfig=/usr/lib/tk8.5' '--with-includes=/usr/include/tcl8.5' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.3/man' '--docdir=/usr/share/doc/postgresql-doc-9.3' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.3' '--bindir=/usr/lib/postgresql/9.3/bin' '--libdir=/usr/lib/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-debug' '--disable-rpath' '--with-ossp-uuid' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' 'CPPFLAGS=-D_FORTIFY_SOURCE=2' CC = gcc CPPFLAGS = -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 CFLAGS = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -fPIC -pie -I/usr/include/mit-krb5 -DLINUX_OOM_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g CFLAGS_SL = -fpic LDFLAGS = -L../../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5 -L/usr/lib/x86_64-linux-gnu -Wl,--as-needed LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lpam -lssl -lcrypto -lkrb5 -lcom_err -lgssapi_krb5 -lz -ledit -lcrypt -ldl -lm VERSION = PostgreSQL 9.3.2 -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 3:33 PM, Andres Freund and...@2ndquadrant.com wrote: Any other changes but the upgrade? Maybe a different compiler version? Just the upgrade; they're using the Ubuntu packages from apt.postgresql.org. Also, could you share some details about the workload? Highly concurrent? Standby? ... The workload is not very highly concurrent; actually quite lightly loaded. There are a very large number (442,000) of user tables. No standby attached. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm, a PANIC really ought to result in a core file. You sure you don't have that disabled (perhaps via a ulimit setting)? Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in /etc/postgresql/9.3/main/pg_ctl.conf. As for the root cause, it's hard to say. The file/line number says it's a buffer header lock that's stuck. I rechecked all the places that lock buffer headers, and all of them have very short code paths to the corresponding unlock, so there's no obvious explanation how this could happen. The server was running with shared_buffers=100GB, but the problem has reoccurred now with shared_buffers=16GB. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: On Dec 12, 2013, at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hm, a PANIC really ought to result in a core file. You sure you don't have that disabled (perhaps via a ulimit setting)? Since it's using the Ubuntu packaging, we have pg_ctl_options = '-c' in /etc/postgresql/9.3/main/pg_ctl.conf. [ shrug... ] If you aren't getting a core file for a PANIC, then core files are disabled. I take no position on the value of the setting you mention above, but I will note that pg_ctl can't override a hard ulimit -c 0 system-wide setting. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 4:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: If you aren't getting a core file for a PANIC, then core files are disabled. And just like that, we get one. Stack trace: #0 0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f699c81991b in errfinish () #3 0x7f699c81a477 in elog_finish () #4 0x7f699c735db3 in s_lock () #5 0x7f699c71e1f0 in ?? () #6 0x7f699c71eaf9 in ?? () #7 0x7f699c71f53e in ReadBufferExtended () #8 0x7f699c56d03a in index_fetch_heap () #9 0x7f699c67a0b7 in ?? () #10 0x7f699c66e98e in ExecScan () #11 0x7f699c6679a8 in ExecProcNode () #12 0x7f699c67407f in ExecAgg () #13 0x7f699c6678b8 in ExecProcNode () #14 0x7f699c664dd2 in standard_ExecutorRun () #15 0x7f6996ad928d in ?? () from /usr/lib/postgresql/9.3/lib/auto_explain.so #16 0x7f69968d3525 in ?? () from /usr/lib/postgresql/9.3/lib/pg_stat_statements.so #17 0x7f699c745207 in ?? () #18 0x7f699c746651 in PortalRun () #19 0x7f699c742960 in PostgresMain () #20 0x7f699c6ff765 in PostmasterMain () #21 0x7f699c53bea2 in main () -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-12 16:22:28 -0800, Christophe Pettus wrote: On Dec 12, 2013, at 4:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: If you aren't getting a core file for a PANIC, then core files are disabled. And just like that, we get one. Stack trace: #0 0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f699c81991b in errfinish () #3 0x7f699c81a477 in elog_finish () #4 0x7f699c735db3 in s_lock () #5 0x7f699c71e1f0 in ?? () #6 0x7f699c71eaf9 in ?? () Could you install the -dbg package and regenerate? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote: Could you install the -dbg package and regenerate? Of course! #0 0x7f699a4fa425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7f699a4fdb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7f699c81991b in errfinish (dummy=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542 #3 0x7f699c81a477 in elog_finish (elevel=optimized out, fmt=0x7f699c937a48 stuck spinlock (%p) detected at %s:%d) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297 #4 0x7f699c735db3 in s_lock_stuck (line=1099, file=0x7f699c934a78 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c, lock=0x7f6585e2cbb4 \001) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40 #5 s_lock (lock=0x7f6585e2cbb4 \001, file=0x7f699c934a78 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c, line=1099) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109 #6 0x7f699c71e1f0 in PinBuffer (buf=0x7f6585e2cb94, strategy=0x0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099 #7 0x7f699c71eaf9 in BufferAlloc (foundPtr=0x7fff60ec563e , strategy=0x0, blockNum=1730, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776 #8 ReadBuffer_common (smgr=optimized out, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1730, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff60ec56af ) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333 #9 0x7f699c71f53e in ReadBufferExtended (reln=0x7f6577d80560, forkNum=MAIN_FORKNUM, blockNum=1730, mode=optimized out, strategy=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252 #10 0x7f699c56d03a in index_fetch_heap (scan=0x7f699f94c7a0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/index/indexam.c:515 #11 0x7f699c67a0b7 in IndexOnlyNext (node=0x7f699f94b690) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeIndexonlyscan.c:109 #12 0x7f699c66e98e in ExecScanFetch ( recheckMtd=0x7f699c679fb0 IndexOnlyRecheck, accessMtd=0x7f699c679fe0 IndexOnlyNext, node=0x7f699f94b690) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c---Type return to continue, or q return to quit--- :82 #13 ExecScan (node=0x7f699f94b690, accessMtd=0x7f699c679fe0 IndexOnlyNext, recheckMtd=0x7f699c679fb0 IndexOnlyRecheck) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167 #14 0x7f699c6679a8 in ExecProcNode (node=0x7f699f94b690) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:408 #15 0x7f699c67407f in agg_retrieve_direct (aggstate=0x7f699f94af90) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121 #16 ExecAgg (node=0x7f699f94af90) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013 #17 0x7f699c6678b8 in ExecProcNode (node=0x7f699f94af90) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476 #18 0x7f699c664dd2 in ExecutePlan (dest=0x7f699f98c308, direction=optimized out, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x7f699f94af90, estate=0x7f699f94ae80) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472 #19 standard_ExecutorRun (queryDesc=0x7f699f940dc0, direction=optimized out, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307 #20 0x7f6996ad928d in explain_ExecutorRun (queryDesc=0x7f699f940dc0, direction=ForwardScanDirection, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233 #21 0x7f69968d3525 in pgss_ExecutorRun (queryDesc=0x7f699f940dc0, direction=ForwardScanDirection, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717 #22 0x7f699c745207 in PortalRunSelect (portal=0x7f699de596a0, forward=optimized out, count=0, dest=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946 #23 0x7f699c746651 in PortalRun (portal=0x7f699de596a0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f699f98c308, altdest=0x7f699f98c308, completionTag=0x7fff60ec5f30 ) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:790 #24 0x7f699c742960 in exec_simple_query ( query_string=0x7f699dd564a0 SELECT COUNT(*) FROM \signups\ WHERE (signups.is_supporter = true)) at
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote: Could you install the -dbg package and regenerate? Here's another, same system, different crash: #0 0x7fa03faf5425 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7fa03faf8b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7fa041e1491b in errfinish (dummy=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:542 #3 0x7fa041e15477 in elog_finish (elevel=optimized out, fmt=0x7fa041f32a48 stuck spinlock (%p) detected at %s:%d) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/utils/error/elog.c:1297 #4 0x7fa041d30db3 in s_lock_stuck (line=1099, file=0x7fa041f2fa78 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c, lock=0x7f9c2acb2ac8 \001) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:40 #5 s_lock (lock=0x7f9c2acb2ac8 \001, file=0x7fa041f2fa78 /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c, line=1099) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/lmgr/s_lock.c:109 #6 0x7fa041d191f0 in PinBuffer (buf=0x7f9c2acb2aa8, strategy=0x0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:1099 #7 0x7fa041d19af9 in BufferAlloc (foundPtr=0x7fff1948963e \001, strategy=0x0, blockNum=8796, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:776 #8 ReadBuffer_common (smgr=optimized out, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=8796, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff194896af ) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:333 #9 0x7fa041d1a53e in ReadBufferExtended (reln=0x7f9c1edd4908, forkNum=MAIN_FORKNUM, blockNum=8796, mode=optimized out, strategy=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/storage/buffer/bufmgr.c:252 #10 0x7fa041b5a706 in heapgetpage (scan=0x7fa043389050, page=8796) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:332 #11 0x7fa041b5ac12 in heapgettup_pagemode (scan=0x7fa043389050, dir=optimized out, nkeys=0, key=0x0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:939 #12 0x7fa041b5bf76 in heap_getnext (scan=0x7fa043389050, direction=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/access/heap/heapam.c:1459 #13 0x7fa041c7a9eb in SeqNext (node=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeSeqscan.c:66 #14 0x7fa041c6998e in ExecScanFetch ( recheckMtd=0x7fa041c7a9b0 SeqRecheck, accessMtd=0x7fa041c7a9c0 SeqNext, node=0x7fa0440f1c10) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:82 #15 ExecScan (node=0x7fa0440f1c10, accessMtd=0x7fa041c7a9c0 SeqNext, recheckMtd=0x7fa041c7a9b0 SeqRecheck) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execScan.c:167 #16 0x7fa041c629c8 in ExecProcNode (node=0x7fa0440f1c10) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:400 #17 0x7fa041c6f07f in agg_retrieve_direct (aggstate=0x7fa0440f1510) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1121 #18 ExecAgg (node=0x7fa0440f1510) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/nodeAgg.c:1013 #19 0x7fa041c628b8 in ExecProcNode (node=0x7fa0440f1510) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execProcnode.c:476 #20 0x7fa041c5fdd2 in ExecutePlan (dest=0x7fa042a955e0, direction=optimized out, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x7fa0440f1510, estate=0x7fa0440f1400) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:1472 #21 standard_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=optimized out, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/executor/execMain.c:307 #22 0x7fa03c0d428d in explain_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=ForwardScanDirection, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/auto_explain/auto_explain.c:233 #23 0x7fa03bece525 in pgss_ExecutorRun (queryDesc=0x7fa0440f0ff0, direction=ForwardScanDirection, count=0) at /tmp/buildd/postgresql-9.3-9.3.2/build/../contrib/pg_stat_statements/pg_stat_statements.c:717 #24 0x7fa041d40207 in PortalRunSelect (portal=0x7fa0427061f0, forward=optimized out, count=0, dest=optimized out) at /tmp/buildd/postgresql-9.3-9.3.2/build/../src/backend/tcop/pquery.c:946 #25 0x7fa041d41651 in PortalRun (portal=0x7fa0427061f0, count=9223372036854775807, isTopLevel=1
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: On Dec 12, 2013, at 4:23 PM, Andres Freund and...@2ndquadrant.com wrote: Could you install the -dbg package and regenerate? Here's another, same system, different crash: Both of these look like absolutely run-of-the-mill buffer access attempts. Presumably, we are seeing the victim rather than the perpetrator of whatever is going wrong. Whoever is holding the spinlock is just going down with the rest of the system ... In a devel environment, I'd try using the postmaster's -T switch so that it SIGSTOP's all the backends instead of SIGQUIT'ing them, and then I'd run around and gdb all the other backends to try to see which one was holding the spinlock and why. Unfortunately, that's probably not practical in a production environment; it'd take too long to collect the stack traces by hand. So I have no good ideas about how to debug this, unless you can reproduce it on a devel box, or are willing to run modified executables in production. Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. What would be more useful today is an option to send SIGABRT, or some other signal that would force core dumps. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Presumably, we are seeing the victim rather than the perpetrator of whatever is going wrong. This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations? -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Christophe Pettus x...@thebuild.com writes: On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Presumably, we are seeing the victim rather than the perpetrator of whatever is going wrong. This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations? It's not a buffer *pin* that's at issue, it's a buffer header spinlock. And there are no loops, of any sort, that are executed while holding such a spinlock. At least not in the core PG code. Are you possibly using any nonstandard extensions? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: Are you possibly using any nonstandard extensions? No, totally stock PostgreSQL. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Hi, On 2013-12-12 13:50:06 -0800, Christophe Pettus wrote: Immediately after an upgrade from 9.3.1 to 9.3.2, we have a client getting frequent (hourly) errors of the form: Is it really a regular pattern like hourly? What's your checkpoint_segments? Could you, arround the time of a crash, check grep Dirt /proc/meminfo and run iostat -xm 1 20? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Thu, Dec 12, 2013 at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Memo to hackers: I think the SIGSTOP stuff is rather obsolete now that most systems dump core files with process IDs embedded in the names. What would be more useful today is an option to send SIGABRT, or some other signal that would force core dumps. Thoughts? I think it would be possible, at least on Linux, to have GDB connect to the postmaster, and then automatically create new inferiors as new backends are forked, and then have every inferior paused as breakpoints are hit. See: http://sourceware.org/gdb/onlinedocs/gdb/Forks.html and http://sourceware.org/gdb/onlinedocs/gdb/All_002dStop-Mode.html (I think the word 'thread' is just a shorthand for 'inferior' in the stops mode doc page, and you can definitely debug Postgres processes in multiple inferiors today). Now, I'm not sure how feasible this is in a production debugging situation. It seems like an interesting way of debugging these sorts of issues that should be explored and perhaps subsequently codified. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On 2013-12-12 21:15:29 -0500, Tom Lane wrote: Christophe Pettus x...@thebuild.com writes: On Dec 12, 2013, at 5:45 PM, Tom Lane t...@sss.pgh.pa.us wrote: Presumably, we are seeing the victim rather than the perpetrator of whatever is going wrong. This is probing about a bit blindly, but the only thing I can see about this system that is in some way unique (and this is happening on multiple machines, so it's unlikely to be hardware) is that there are a relatively large number of relations (like, 440,000+) distributed over many schemas. Is there anything that pins a buffer that is O(N) to the number of relations? It's not a buffer *pin* that's at issue, it's a buffer header spinlock. And there are no loops, of any sort, that are executed while holding such a spinlock. At least not in the core PG code. Are you possibly using any nonstandard extensions? It could maybe be explained by a buffer aborting while performing IO. Until it has call AbortBufferIO(), other backends will happily loop in WaitIO(), constantly taking the the buffer header spinlock and locking io_in_progress_lock in shared mode, thereby preventing AbortBufferIO() from succeeding. Christophe: are there any unusual ERROR messages preceding the crash, possibly some minutes before? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 6:24 PM, Andres Freund and...@2ndquadrant.com wrote: Is it really a regular pattern like hourly? What's your checkpoint_segments? No, it's not a pattern like that; that's an approximation. Sometimes, they come in clusters, sometimes, 2-3 hours past without one. They don't happen exclusively inside or outside of a checkpoint. checkpoint_timeout = 5min checkpoint_segments = 64 checkpoint_completion_target = 0.9 Could you, arround the time of a crash, check grep Dirt /proc/meminfo and run iostat -xm 1 20? Dirty: 30104 kB avg-cpu: %user %nice %system %iowait %steal %idle 3.700.000.910.530.00 94.85 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.83 113.131.182.01 0.06 0.45 329.29 0.04 12.181.28 18.55 0.57 0.18 sdb 0.06 113.150.981.99 0.06 0.45 349.36 0.24 79.303.57 116.60 1.46 0.43 md0 0.00 0.000.000.00 0.00 0.00 3.39 0.000.000.000.00 0.00 0.00 md1 0.00 0.001.18 114.92 0.01 0.45 8.01 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.06 111.82 0.00 0.44 8.02 0.574.880.244.89 0.04 0.43 dm-1 0.00 0.001.113.03 0.00 0.01 8.00 1.25 300.470.38 410.89 0.17 0.07 sdc 0.00 0.00 12.10 136.13 0.5019.97 282.85 1.94 13.072.30 14.03 0.55 8.20 dm-2 0.0039.63 24.23 272.24 1.0039.82 281.97 1.314.441.984.65 0.44 13.03 sdd 0.00 0.00 12.13 136.11 0.5019.84 281.10 1.359.101.649.77 0.42 6.21 avg-cpu: %user %nice %system %iowait %steal %idle 1.090.000.080.130.00 98.71 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 md0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 md1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.00 558.00 0.00 8.9532.85 7.36 13.200.00 13.20 0.12 6.80 dm-2 0.0028.000.00 558.00 0.00 8.9532.85 7.38 13.230.00 13.23 0.12 6.80 sdd 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.380.000.170.130.00 99.33 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 md0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 md1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.00 36.00 11.00 0.18 0.1514.30 0.061.360.673.64 0.94 4.40 dm-2 0.00 0.00 36.00 11.00 0.18 0.1514.30 0.061.360.673.64 0.94 4.40 sdd 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.830.000.290.040.00 98.83 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.00
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 6:41 PM, Andres Freund and...@2ndquadrant.com wrote: Christophe: are there any unusual ERROR messages preceding the crash, possibly some minutes before? Interestingly, each spinlock PANIC is *followed*, about one minute later (+/- five seconds) by a canceling statement due to statement timeout on that exact query. The queries vary enough in text that it is unlikely to be a coincidence. There are a *lot* of canceling statement due to statement timeout messages, which is interesting, because: postgres=# show statement_timeout; statement_timeout --- 0 (1 row) -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Thu, Dec 12, 2013 at 7:35 PM, Christophe Pettus x...@thebuild.com wrote: There are a *lot* of canceling statement due to statement timeout messages, which is interesting, because: postgres=# show statement_timeout; statement_timeout --- 0 (1 row) Couldn't that just be the app setting it locally? In fact, isn't that the recommended usage? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
On Dec 12, 2013, at 7:40 PM, Peter Geoghegan p...@heroku.com wrote: Couldn't that just be the app setting it locally? Yes, that's what is happening there (I had to check with the client's developers). It's possible that the one-minute repeat is due to the application reissuing the query, rather than specifically related to the spinlock issue. What this does reveal is that all the spinlock issues have been on long-running queries, for what it is worth. -- -- Christophe Pettus x...@thebuild.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stuck spinlock
Tom Lane wrote: Judging from the line number, this is in CreateCheckPoint. I'm betting that your platform (Solaris 2.7, you said?) has the same odd behavior that I discovered a couple days ago on HPUX: a select with a delay of tv_sec = 0, tv_usec = 100 doesn't delay 1 second like a reasonable person would expect, but fails instantly with EINVAL. After I finally understood what you meant, this behavior looks somehow reasonable to me as its a struct, but I must admit, that I don't have to much knowledge in this area. Anyway, after further thoughts I was curious about this odd behavior on the different platforms and I used your previously posted program, extended it a little bit and run it on all platforms I could get a hold of. Please have a look at the extracted log and comments below about the different platforms. It seems, that this functions a "good" example of a really incompatible implementation across platforms, even within the same across different versions of the OSs. Happy wondering ;-) In short: please try the latest nightly snapshot (this fix is since beta5, unfortunately) and let me know if you still see a problem. I did and I didn't get the error yet, but didn't run as many jobs either. If I get the error again, I'll post it. Thanks for your help, Peter = AIX 4.3.3 = Delay | elapsed Time| actual Wait --- 0 | 0.0 [msec/loop] | 0.0 [msec/sel] 500 | 10.3 [msec/loop] | 1.0 [msec/sel] 1000 | 10.3 [msec/loop] | 1.0 [msec/sel] 1500 | 15.3 [msec/loop] | 1.5 [msec/sel] 2000 | 20.3 [msec/loop] | 2.0 [msec/sel] ... 98 | 9800.3 [msec/loop] | 980.0 [msec/sel] 99 | 9899.3 [msec/loop] | 989.9 [msec/sel] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 100 | 0.1 [msec/loop] | 0.0 [msec/sel] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 101 | 0.1 [msec/loop] | 0.0 [msec/sel] other more granular run with steps of 10 usec after 1000: NUM_OF_LOOPS: 1000 Delay | elapsed Time| actual Wait --- 0 | 2.7090 [msec/loop] | 0.0027 [msec/sel] 100 | 1024.7370 [msec/loop] | 1.0247 [msec/sel] 200 | 1024.3160 [msec/loop] | 1.0243 [msec/sel] 300 | 1024.6510 [msec/loop] | 1.0247 [msec/sel] 400 | 1024.5030 [msec/loop] | 1.0245 [msec/sel] 500 | 1024.5400 [msec/loop] | 1.0245 [msec/sel] 600 | 1024.8340 [msec/loop] | 1.0248 [msec/sel] 700 | 1024.3110 [msec/loop] | 1.0243 [msec/sel] 800 | 1024.7030 [msec/loop] | 1.0247 [msec/sel] 900 | 1024.4560 [msec/loop] | 1.0245 [msec/sel] 1000 | 1024.2810 [msec/loop] | 1.0243 [msec/sel] 1010 | 1034.4840 [msec/loop] | 1.0345 [msec/sel] 1020 | 1044.0490 [msec/loop] | 1.0440 [msec/sel] 1030 | 1054.3530 [msec/loop] | 1.0544 [msec/sel] 1040 | 1064.6620 [msec/loop] | 1.0647 [msec/sel] 1050 | 1074.0980 [msec/loop] | 1.0741 [msec/sel] 1060 | 1084.4850 [msec/loop] | 1.0845 [msec/sel] 1070 | 1094.1270 [msec/loop] | 1.0941 [msec/sel] 1080 | 1104.4080 [msec/loop] | 1.1044 [msec/sel] 1090 | 1132.8880 [msec/loop] | 1.1329 [msec/sel] 1100 | 1124.2220 [msec/loop] | 1.1242 [msec/sel] Comments: o minimum is 1 msec until 1000 usec and than it tries to respect the actual number in usec o usec = 1 sec not allowed HP-UX 10.20 === NUM_OF_LOOPS: 10 Delay | elapsed Time| actual Wait --- 0 | 0.1 [msec/loop] | 0.0 [msec/sel] 500 | 97.6 [msec/loop] | 9.8 [msec/sel] 1000 | 100.0 [msec/loop] | 10.0 [msec/sel] 1500 | 100.0 [msec/loop] | 10.0 [msec/sel] ... 14000 | 100.0 [msec/loop] | 10.0 [msec/sel] 14500 | 100.2 [msec/loop] | 10.0 [msec/sel] 15000 | 199.8 [msec/loop] | 20.0 [msec/sel] 15500 | 200.0 [msec/loop] | 20.0 [msec/sel] ... 24000 | 200.0 [msec/loop] | 20.0 [msec/sel] 24500 | 200.0 [msec/loop] | 20.0 [msec/sel] 25000 | 300.0 [msec/loop] | 30.0 [msec/sel] 25500 | 300.0 [msec/loop] | 30.0 [msec/sel] ... 98 | 9800.1 [msec/loop] | 980.0 [msec/sel] 99 | 9900.0 [msec/loop] | 990.0 [msec/sel] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 100 | 0.1 [msec/loop] | 0.0 [msec/sel] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 101 | 0.1 [msec/loop] | 0.0 [msec/sel] Comments: o minimum is 10 msec until 1000 usec o after 1000 it rounds down or up to the next 10msec o usec = 1 sec not allowed HP-UX 11 === NUM_OF_LOOPS: 10 Delay | elapsed Time| actual Wait --- 0 | 92.7 [msec/loop] | 9.3 [msec/sel] 500 | 99.9 [msec/loop] | 10.0 [msec/sel] 1000 | 99.8 [msec/loop] | 10.0 [msec/sel] 1500 | 100.0 [msec/loop] | 10.0 [msec/sel] ... 9000 | 99.9 [msec/loop] | 10.0 [msec/sel] 9500 | 100.1 [msec/loop] | 10.0 [msec/sel] 1 | 199.9 [msec/loop] | 20.0 [msec/sel] 10500 | 199.9
Re: [HACKERS] stuck spinlock
Interesting numbers --- thanks for sending them along. Looks like I was mistaken to think that most platforms would allow tv_usec = 1 sec. Ah well, another day, another bug... regards, tom lane
[HACKERS] stuck spinlock
Can anyone tell me what is going on, when I get a stuck spinlock? Is there a data corruption or anything else to worry about ? I've found some references about spinlocks in the -hackers list, so is that fixed with a later version than beta4 already? Actually I was running a stack of pgbench jobs with varying commit_delay parameter and # of clients, but it doesn't look deterministic on any of their values. I've got those fatal errors, with exactly the same data several times now. I've restarted the postmaster as well as I've dropped the bench database and recreated it, but it didn't really help. That error is still coming *sometimes*. BTW, I think I didn't see this before, when I was running pgbench only once from the command line, but since I use the script with the for loop. Some environment info: bench=# select version(); version - PostgreSQL 7.1beta4 on sparc-sun-solaris2.7, compiled by GCC 2.95.1 checkpoint_timeout = 1800 # range 30-1800 commit_delay = 0 # range 0-1000 debug_level = 0 # range 0-16 fsync = false max_connections = 100 # 1-1024 shared_buffers = 4096 sort_mem = 4096 tcpip_socket = true wal_buffers = 128 # min 4 wal_debug = 0 # range 0-16 wal_files = 10 # range 0-64 pgbench -i -s 10 bench ... PGOPTIONS="-c commit_delay=$del " \ pgbench -c $cli -t 100 -n bench Thanks, Peter = FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting. FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting. Server process (pid 7889) exited with status 6 at Mon Feb 26 09:17:36 2001 Terminating any active server processes... NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corr upted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. The Data Base System is in recovery mode Server processes were terminated at Mon Feb 26 09:17:36 2001 Reinitializing shared memory and semaphores DEBUG: starting up DEBUG: database system was interrupted at 2001-02-26 09:17:33 DEBUG: CheckPoint record at (0, 3648965776) DEBUG: Redo record at (0, 3648965776); Undo record at (0, 0); Shutdown FALSE DEBUG: NextTransactionId: 1362378; NextOid: 2362993 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: redo starts at (0, 3648965840) DEBUG: ReadRecord: record with zero len at (0, 3663163376) DEBUG: Formatting logfile 0 seg 218 block 699 at offset 4080 DEBUG: The last logId/logSeg is (0, 218) DEBUG: redo done at (0, 3663163336) -- Best regards, Peter Schindler Synchronicity Inc.| [EMAIL PROTECTED] http://www.synchronicity.com | +49 89 89 66 99 42 (Germany)
Re: [HACKERS] stuck spinlock
Peter Schindler [EMAIL PROTECTED] writes: FATAL: s_lock(fcc01067) at xlog.c:2088, stuck spinlock. Aborting. Judging from the line number, this is in CreateCheckPoint. I'm betting that your platform (Solaris 2.7, you said?) has the same odd behavior that I discovered a couple days ago on HPUX: a select with a delay of tv_sec = 0, tv_usec = 100 doesn't delay 1 second like a reasonable person would expect, but fails instantly with EINVAL. This causes the spinlock timeout in CreateCheckPoint to effectively be only a few microseconds rather than the intended ten minutes. So, if the postmaster happens to fire off a checkpoint process while some regular backend is doing something with the WAL log, kaboom. In short: please try the latest nightly snapshot (this fix is since beta5, unfortunately) and let me know if you still see a problem. regards, tom lane
[HACKERS] Stuck Spinlock (fwd) - m68k architecture, 7.0.3
Has anyone got PostgreSQL 7.0.3 working on m68k architecture? Russell is trying to install it on m68k and is consistently getting a stuck spinlock in initdb. He used to have 6.3.2 working. Both 6.5.3 and 7.0.3 fail. His message shows that the first attempt to set a lock fails. --- Forwarded Message Date:Mon, 05 Feb 2001 09:03:21 -0500 From:Russell Hires [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Stuck Spinlock Hey, here is the spinlock test results... Thanks! Russell rusty@smurfette:~/postgresql-7.0.3/src/backend/storage/buffer$ make s_lock_test gcc -I../../../include -I../../../backend -O2 -g -g3 -Wall - -Wmissing-prototypes -Wmissing-declarations -I../.. -DS_LOCK_TEST=1 s_lock.c - -o s_lock_test s_lock.c:251: warning: return type of `main' is not `int' ./s_lock_test FATAL: s_lock(80002974) at s_lock.c:260, stuck spinlock. Aborting. FATAL: s_lock(80002974) at s_lock.c:260, stuck spinlock. Aborting. make: *** [s_lock_test] Aborted make: *** Deleting file `s_lock_test' --- End of Forwarded Message -- Oliver Elphick[EMAIL PROTECTED] Isle of Wight http://www.lfix.co.uk/oliver PGP: 1024R/32B8FAA1: 97 EA 1D 47 72 3F 28 47 6B 7E 39 CC 56 E4 C1 47 GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C "Lift up your heads, O ye gates; and be ye lift up, ye everlasting doors; and the King of glory shall come in. Who is this King of glory? The LORD strong and mighty, the LORD mighty in battle." Psalms 24:7,8