Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-08-13 Thread Peter Eisentraut
On Thu, 2012-08-09 at 16:26 -0500, Merlin Moncure wrote:
 Follow up on this.  It is pl/sh and it is a newline issue: one of the
 developers is using a tool (I think pgadmin?) that is sticking \r
 characters at the end of every line which is throwing off pl/sh's
 shebang parsing.  The issuing query gets an error along the lines of
 'could not exec' and the server goes belly up if there is significant
 concurrent load when that's issued.  This is an out of date pl/sh, so
 I'm going to upgrade it and try and reproduce. If I still can, I'll
 supply a test case.

I had received an independent report of this cr/lf issue and fixed it
now.  But that doesn't explain why the server would crash.



-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-08-09 Thread Merlin Moncure
On Tue, Jun 26, 2012 at 12:09 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 I suspect (but haven't had time to prove and may not for several days
 -- unfortunately going on vacation momentarily) that this might be
 caused by pl/sh.

 Hm.  The reported symptoms might be explainable if something had caused
 multiple threads to become active within the backend process --- then
 it would be plausible for it to try to do proc_exit cleanup twice.
 Which would explain the first two errors, though I'm not sure how that
 leads to failing to disown the process latch, as the third error
 suggests must have happened.  But I don't know enough about pl/sh to
 know if it could cause threading activation.

 In particular, we have a routine that was
 inadvertently applied to the database in with windows cr/lf instead of
 the normal linux newline.

 This doesn't seem real promising as an explanation ...

 right -- just a suspicion.  maybe the relevant point was that it
 immediately failed.  operator invoking the busted routine (which I had
 to fix) and the crash were highly correlated, although it does not
 always crash.  yesterday  was very heavy load and today not so much.

Follow up on this.  It is pl/sh and it is a newline issue: one of the
developers is using a tool (I think pgadmin?) that is sticking \r
characters at the end of every line which is throwing off pl/sh's
shebang parsing.  The issuing query gets an error along the lines of
'could not exec' and the server goes belly up if there is significant
concurrent load when that's issued.  This is an out of date pl/sh, so
I'm going to upgrade it and try and reproduce. If I still can, I'll
supply a test case.

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-06-26 Thread Merlin Moncure
On Mon, Jun 25, 2012 at 10:03 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG:  could not
 send data to client: Broken pipe
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  unexpected
 EOF on client connection
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  process 22821
 releasing ProcSignal slot 32, but it contains 0
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  failed to
 find proc 0x7f48617e2ab0 in ProcArray
 [and a bit later]
 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL:  latch already 
 owned

 I think what we're looking at here is a screw-up in the process shutdown
 sequence.  Perhaps caused by bad recovery from an attempt to send an
 error message to the already-disconnected client; but that's just
 speculation, and it's hard to see how to get more info without a core
 dump.

 I wonder whether we shouldn't promote some or all of these three error
 cases to PANIC, as they certainly suggest shared-memory corruption.
 And if it did panic, we could hope to get a core dump for debugging
 purposes.

 Ok, I'll look into reproducing the crash conditions.  Unfortunately
 this is a critical server and it crashed during a time sensitive
 process. I can schedule a maintenance window though but it will have
 to wait a bit.

 merlin

I have some good news: this was reproduce and i I believe it to be
operator invoked:


2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: ERROR:  index
idx_lease_expiremonth2 does not exist
2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: STATEMENT:  DROP
INDEX idx_Lease_ExpireMonth2;
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  unexpected EOF on
client connection
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  process 10340
releasing ProcSignal slot 5, but it contains 0
2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG:  failed to find proc
0x7f48617e6310 in ProcArray
2012-06-26 09:16:48 CDT [rms@ysanalysis]: FATAL:  latch already owned
2012-06-26 09:16:48 CDT [@]: LOG:  server process (PID 10928) exited
with exit code 1
2012-06-26 09:16:48 CDT [@]: LOG:  terminating any other active server processes
2012-06-26 09:16:48 CDT [postgres@postgres]: WARNING:  terminating
connection because of crash of another server process

...investigating...

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-06-26 Thread Merlin Moncure
On Tue, Jun 26, 2012 at 9:19 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Ok, I'll look into reproducing the crash conditions.  Unfortunately
 this is a critical server and it crashed during a time sensitive
 process. I can schedule a maintenance window though but it will have
 to wait a bit.

I suspect (but haven't had time to prove and may not for several days
-- unfortunately going on vacation momentarily) that this might be
caused by pl/sh.  In particular, we have a routine that was
inadvertently applied to the database in with windows cr/lf instead of
the normal linux newline.  This is an older version of plgplsh (1.3)
and is maybe minus some relevant bug fixes.  Just wanted to give a
heads up so that you didn't waste time investigating.  I will follow
up on this eventually.

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-06-26 Thread Merlin Moncure
On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 I suspect (but haven't had time to prove and may not for several days
 -- unfortunately going on vacation momentarily) that this might be
 caused by pl/sh.

 Hm.  The reported symptoms might be explainable if something had caused
 multiple threads to become active within the backend process --- then
 it would be plausible for it to try to do proc_exit cleanup twice.
 Which would explain the first two errors, though I'm not sure how that
 leads to failing to disown the process latch, as the third error
 suggests must have happened.  But I don't know enough about pl/sh to
 know if it could cause threading activation.

 In particular, we have a routine that was
 inadvertently applied to the database in with windows cr/lf instead of
 the normal linux newline.

 This doesn't seem real promising as an explanation ...

right -- just a suspicion.  maybe the relevant point was that it
immediately failed.  operator invoking the busted routine (which I had
to fix) and the crash were highly correlated, although it does not
always crash.  yesterday  was very heavy load and today not so much.

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-06-25 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG:  could not
 send data to client: Broken pipe
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  unexpected
 EOF on client connection
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  process 22821
 releasing ProcSignal slot 32, but it contains 0
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  failed to
 find proc 0x7f48617e2ab0 in ProcArray
 [and a bit later]
 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL:  latch already owned

I think what we're looking at here is a screw-up in the process shutdown
sequence.  Perhaps caused by bad recovery from an attempt to send an
error message to the already-disconnected client; but that's just
speculation, and it's hard to see how to get more info without a core
dump.

I wonder whether we shouldn't promote some or all of these three error
cases to PANIC, as they certainly suggest shared-memory corruption.
And if it did panic, we could hope to get a core dump for debugging
purposes.

regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0

2012-06-25 Thread Merlin Moncure
On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG:  could not
 send data to client: Broken pipe
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  unexpected
 EOF on client connection
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  process 22821
 releasing ProcSignal slot 32, but it contains 0
 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG:  failed to
 find proc 0x7f48617e2ab0 in ProcArray
 [and a bit later]
 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL:  latch already 
 owned

 I think what we're looking at here is a screw-up in the process shutdown
 sequence.  Perhaps caused by bad recovery from an attempt to send an
 error message to the already-disconnected client; but that's just
 speculation, and it's hard to see how to get more info without a core
 dump.

 I wonder whether we shouldn't promote some or all of these three error
 cases to PANIC, as they certainly suggest shared-memory corruption.
 And if it did panic, we could hope to get a core dump for debugging
 purposes.

Ok, I'll look into reproducing the crash conditions.  Unfortunately
this is a critical server and it crashed during a time sensitive
process. I can schedule a maintenance window though but it will have
to wait a bit.

merlin

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs