Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Thu, 2012-08-09 at 16:26 -0500, Merlin Moncure wrote: Follow up on this. It is pl/sh and it is a newline issue: one of the developers is using a tool (I think pgadmin?) that is sticking \r characters at the end of every line which is throwing off pl/sh's shebang parsing. The issuing query gets an error along the lines of 'could not exec' and the server goes belly up if there is significant concurrent load when that's issued. This is an out of date pl/sh, so I'm going to upgrade it and try and reproduce. If I still can, I'll supply a test case. I had received an independent report of this cr/lf issue and fixed it now. But that doesn't explain why the server would crash. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Tue, Jun 26, 2012 at 12:09 PM, Merlin Moncure mmonc...@gmail.com wrote: On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: I suspect (but haven't had time to prove and may not for several days -- unfortunately going on vacation momentarily) that this might be caused by pl/sh. Hm. The reported symptoms might be explainable if something had caused multiple threads to become active within the backend process --- then it would be plausible for it to try to do proc_exit cleanup twice. Which would explain the first two errors, though I'm not sure how that leads to failing to disown the process latch, as the third error suggests must have happened. But I don't know enough about pl/sh to know if it could cause threading activation. In particular, we have a routine that was inadvertently applied to the database in with windows cr/lf instead of the normal linux newline. This doesn't seem real promising as an explanation ... right -- just a suspicion. maybe the relevant point was that it immediately failed. operator invoking the busted routine (which I had to fix) and the crash were highly correlated, although it does not always crash. yesterday was very heavy load and today not so much. Follow up on this. It is pl/sh and it is a newline issue: one of the developers is using a tool (I think pgadmin?) that is sticking \r characters at the end of every line which is throwing off pl/sh's shebang parsing. The issuing query gets an error along the lines of 'could not exec' and the server goes belly up if there is significant concurrent load when that's issued. This is an out of date pl/sh, so I'm going to upgrade it and try and reproduce. If I still can, I'll supply a test case. merlin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Mon, Jun 25, 2012 at 10:03 AM, Merlin Moncure mmonc...@gmail.com wrote: On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: could not send data to client: Broken pipe 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: unexpected EOF on client connection 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: process 22821 releasing ProcSignal slot 32, but it contains 0 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: failed to find proc 0x7f48617e2ab0 in ProcArray [and a bit later] 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: latch already owned I think what we're looking at here is a screw-up in the process shutdown sequence. Perhaps caused by bad recovery from an attempt to send an error message to the already-disconnected client; but that's just speculation, and it's hard to see how to get more info without a core dump. I wonder whether we shouldn't promote some or all of these three error cases to PANIC, as they certainly suggest shared-memory corruption. And if it did panic, we could hope to get a core dump for debugging purposes. Ok, I'll look into reproducing the crash conditions. Unfortunately this is a critical server and it crashed during a time sensitive process. I can schedule a maintenance window though but it will have to wait a bit. merlin I have some good news: this was reproduce and i I believe it to be operator invoked: 2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: ERROR: index idx_lease_expiremonth2 does not exist 2012-06-26 09:12:19 CDT [postgres@ysanalysis_hes]: STATEMENT: DROP INDEX idx_Lease_ExpireMonth2; 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: unexpected EOF on client connection 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: process 10340 releasing ProcSignal slot 5, but it contains 0 2012-06-26 09:15:10 CDT [rms@ysanalysis]: LOG: failed to find proc 0x7f48617e6310 in ProcArray 2012-06-26 09:16:48 CDT [rms@ysanalysis]: FATAL: latch already owned 2012-06-26 09:16:48 CDT [@]: LOG: server process (PID 10928) exited with exit code 1 2012-06-26 09:16:48 CDT [@]: LOG: terminating any other active server processes 2012-06-26 09:16:48 CDT [postgres@postgres]: WARNING: terminating connection because of crash of another server process ...investigating... merlin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Tue, Jun 26, 2012 at 9:19 AM, Merlin Moncure mmonc...@gmail.com wrote: Ok, I'll look into reproducing the crash conditions. Unfortunately this is a critical server and it crashed during a time sensitive process. I can schedule a maintenance window though but it will have to wait a bit. I suspect (but haven't had time to prove and may not for several days -- unfortunately going on vacation momentarily) that this might be caused by pl/sh. In particular, we have a routine that was inadvertently applied to the database in with windows cr/lf instead of the normal linux newline. This is an older version of plgplsh (1.3) and is maybe minus some relevant bug fixes. Just wanted to give a heads up so that you didn't waste time investigating. I will follow up on this eventually. merlin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Tue, Jun 26, 2012 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: I suspect (but haven't had time to prove and may not for several days -- unfortunately going on vacation momentarily) that this might be caused by pl/sh. Hm. The reported symptoms might be explainable if something had caused multiple threads to become active within the backend process --- then it would be plausible for it to try to do proc_exit cleanup twice. Which would explain the first two errors, though I'm not sure how that leads to failing to disown the process latch, as the third error suggests must have happened. But I don't know enough about pl/sh to know if it could cause threading activation. In particular, we have a routine that was inadvertently applied to the database in with windows cr/lf instead of the normal linux newline. This doesn't seem real promising as an explanation ... right -- just a suspicion. maybe the relevant point was that it immediately failed. operator invoking the busted routine (which I had to fix) and the crash were highly correlated, although it does not always crash. yesterday was very heavy load and today not so much. merlin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
Merlin Moncure mmonc...@gmail.com writes: 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: could not send data to client: Broken pipe 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: unexpected EOF on client connection 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: process 22821 releasing ProcSignal slot 32, but it contains 0 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: failed to find proc 0x7f48617e2ab0 in ProcArray [and a bit later] 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: latch already owned I think what we're looking at here is a screw-up in the process shutdown sequence. Perhaps caused by bad recovery from an attempt to send an error message to the already-disconnected client; but that's just speculation, and it's hard to see how to get more info without a core dump. I wonder whether we shouldn't promote some or all of these three error cases to PANIC, as they certainly suggest shared-memory corruption. And if it did panic, we could hope to get a core dump for debugging purposes. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] server crash with process 22821 releasing ProcSignal slot 32, but it contains 0
On Mon, Jun 25, 2012 at 9:57 AM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: 2012-06-25 09:08:08 CDT [postgres@ysanalysis_hes]: LOG: could not send data to client: Broken pipe 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: unexpected EOF on client connection 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: process 22821 releasing ProcSignal slot 32, but it contains 0 2012-06-25 09:08:10 CDT [postgres@ysanalysis_hes]: LOG: failed to find proc 0x7f48617e2ab0 in ProcArray [and a bit later] 2012-06-25 09:08:24 CDT [postgres@ysanalysis_hes]: FATAL: latch already owned I think what we're looking at here is a screw-up in the process shutdown sequence. Perhaps caused by bad recovery from an attempt to send an error message to the already-disconnected client; but that's just speculation, and it's hard to see how to get more info without a core dump. I wonder whether we shouldn't promote some or all of these three error cases to PANIC, as they certainly suggest shared-memory corruption. And if it did panic, we could hope to get a core dump for debugging purposes. Ok, I'll look into reproducing the crash conditions. Unfortunately this is a critical server and it crashed during a time sensitive process. I can schedule a maintenance window though but it will have to wait a bit. merlin -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs