Re: [HACKERS] windows doesn't notice backend death
I wrote: Attached is a proposed patch for the dead man switch idea. ... Barring objections I'll go ahead and apply this to HEAD. I'm wondering whether we are sufficiently worried about the Windows task manager issue to risk back-patching into 8.3 and 8.2 ... comments? For lack of response, I assume no one wants to back-patch this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: I wrote: Attached is a proposed patch for the dead man switch idea. ... Barring objections I'll go ahead and apply this to HEAD. I'm wondering whether we are sufficiently worried about the Windows task manager issue to risk back-patching into 8.3 and 8.2 ... comments? For lack of response, I assume no one wants to back-patch this. No, I think it's unnecessary, now we understand what's going on. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: I wrote: Attached is a proposed patch for the dead man switch idea. ... Barring objections I'll go ahead and apply this to HEAD. I'm wondering whether we are sufficiently worried about the Windows task manager issue to risk back-patching into 8.3 and 8.2 ... comments? For lack of response, I assume no one wants to back-patch this. Hmm. I didn't have time to look it over :( In general, killing server processes from task manager in windows is less likely to be a popular thing than using kill on unix (and it still surprises me how many people that consider themselves experts still do kill -9 by defualt whenever they want to stop something..) Given that it actually doesn't notice it if we do, we might have people doing this that don't know about it. But I think we can at least keep it HEAD only for a while until it's seen some productoin level testing... //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Magnus Hagander mag...@hagander.net writes: FWIW, this certainly used to work. So we've either broken this recently, or it's always been broken on Vista (I've never tried it myself on Vista, only 2000, XP and 2003). Maybe a quick check if it still works on non-Vista versions would be in order, to eliminate one or the other of those theories. I'll have to take this back. I just tested back to 8.2.latest on XP, and it has the same behavior. We don't notice a kill -9 from the task manager. If it worked, it's either 8.2, or it's something that changed with a service pack to XP as well. //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Well, I can tell you that it is getting an exit code of 1, which is why the postmaster isn't restarting. Blech. Count on Windows to find a way to break things. Yup. A quick search gives this: http://support.microsoft.com/kb/155075 which clearly states that task manager simply uses TerminateProcess(). And it's probably hard-coded with an exit code. Which really means this can't have worked, and my memory was wrong. I still claim it works when the process actually *crashes* though - since that will give you one of those funky huge exit numbers. That raises two questions in my mind. First, is that the behaviour we expect when we kill the backend this way? And second, why is it still showing up in the output of pg_stat_activity? Well, if the process is being hard-killed without an opportunity to run through proc_exit(), then yes it is going to still show up in pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry. That's certainly what task manager does. TerminateProcess() just stops all threads and removes the process. No chance for atexit() stuff to run. The problem here is that we need to be able to distinguish a task manager kill from a voluntary exit(1). Have M$ really been stupid enough to make an external kill look just like an exit() call? It certainly looks that way. It probably goes back to the use threads, not processes, on this platform design... //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan wrote: justin wrote: Tom Lane wrote: Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane kind of :-( Would it not be easy to set the normal exitcode to something other than 1 to see the difference ExitProcess() Not really, as Tom showed later this is an example of a more general problem. I think his solution of detecting when backends have cleaned up nicely and when they have not is the right way to go. Well, if we picked a different value than 1, the probability would certainly go up that things work. I bet most external libs that do such an evil thing would be doing exit(1). If we picked a return code way up in the space (say something well above 2^16), the likelihood that a lib would be using exactly that one decreases drastically. But there's no guarantee - so it'd just be a better workaround. //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: I wrote: The first idea that comes to mind is to have some sort of dead man switch that flags an active backend and is reset by proc_exit() after it's finished cleaning up everything else. If the postmaster sees this flag still set after backend exit, then it treats the backend as having crashed regardless of what the reported exit code is. Another thought that came to mind: we could set up an atexit hook that does all the work that proc_exit() currently does, and reduce proc_exit() itself to just an exit() call. psql already relies on having atexit (or on_exit) so this doesn't appear to add any new portability issues. This will probably not fix the Vista taskmanager issue, since I'll bet it's not running atexit hooks anyway. What it would do is improve the situation so that a clueless exit() call would be no worse than elog(FATAL), rather than triggering a DB-wide restart as the dead man switch would do. This sounds like a good idea in general - because it will avoid having to restart on unix. It'd still have to be combined with the dead-man-switch idea to defend completely. But it could be worthwhile doing anyway, for other platforms. //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Magnus Hagander mag...@hagander.net writes: Tom Lane wrote: Another thought that came to mind: we could set up an atexit hook that This sounds like a good idea in general - because it will avoid having to restart on unix. It'd still have to be combined with the dead-man-switch idea to defend completely. But it could be worthwhile doing anyway, for other platforms. Right, exactly. I'll try to produce a patch today. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
I wrote: I don't think we'll be able to prevent PHP from doing that :-(. But it now seems clear that we should try to make the database as a whole recover with some degree of grace. I'll go work up a patch. Attached is a proposed patch for the dead man switch idea. The switch is armed when a child process acquires a regular PGPROC and disarmed when the PGPROC is released. (So there is no coverage for auxiliary processes, but I doubt we need that.) I chose to put the management code into pmsignal.c --- if you hold your head at the proper angle, this can be seen as a form of child-to-postmaster signaling, so that seemed like a reasonable place. The array slot number is passed down to child processes in the same way as for MyCancelKey. Also, since the number of array slots needed is exactly the same as the size of the ShmemBackendArray for the EXEC_BACKEND case, I tweaked the code for the latter a little bit to avoid duplicate array-searching --- the array slot number assigned for the deadman switch flag is also used to index ShmemBackendArray. One bit of ugliness is that in the EXEC_BACKEND case, InitProcess is done before CreateSharedMemoryAndSemaphores; so it's necessary to have special-case code to pass down the shmem state pointer for pmsignal.c, similarly to what we do for ProcGlobal and a few other pointers. We could avoid that by arming the switch only sometime after CreateSharedMemoryAndSemaphores ... but if we're gonna have this mechanism at all, I'd like it to cover the process's entire ownership of shared-memory objects, not only part of it. This doesn't include the proposed change to use an atexit callback to ensure that proc_exit cleanup gets done --- that looks like a trivial refactoring in proc_exit, but it's a separate idea anyway. Barring objections I'll go ahead and apply this to HEAD. I'm wondering whether we are sufficiently worried about the Windows task manager issue to risk back-patching into 8.3 and 8.2 ... comments? regards, tom lane binzFCsl5KUH7.bin Description: dead-man-switch-1.patch.gz -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan wrote: Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Tom Lane wrote: Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? If that were the case then it would not have the dead process still listed as a live backend, ISTM, which it does. The postmaster does not control the content of the pg_stat_activity view. Well, I'm not I know how to find out the answer to your question. I could try attaching a debugger to the postmaster - if I knew where to put a breakpoint. I'd start at pgwin32_deadchild_callback(). That's where the waiting thread should activate once a child goes away. That one should post a notice to the win32ChildQueue, which is polled in win32_waitpid() - that's a second good point for a breakpoint. This in turn should make things happen up in reaper() - as Alvaro suggests, a third good place for a breakpoint. FWIW, this certainly used to work. So we've either broken this recently, or it's always been broken on Vista (I've never tried it myself on Vista, only 2000, XP and 2003). //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Magnus Hagander mag...@hagander.net writes: FWIW, this certainly used to work. So we've either broken this recently, or it's always been broken on Vista (I've never tried it myself on Vista, only 2000, XP and 2003). Maybe a quick check if it still works on non-Vista versions would be in order, to eliminate one or the other of those theories. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Magnus Hagander mag...@hagander.net writes: FWIW, this certainly used to work. So we've either broken this recently, or it's always been broken on Vista (I've never tried it myself on Vista, only 2000, XP and 2003). Maybe a quick check if it still works on non-Vista versions would be in order, to eliminate one or the other of those theories. Well, I can tell you that it is getting an exit code of 1, which is why the postmaster isn't restarting. That raises two questions in my mind. First, is that the behaviour we expect when we kill the backend this way? And second, why is it still showing up in the output of pg_stat_activity? Meanwhile, I guess I can try to create a loadable function that will crash a backend badly so I can try to reproduce my client's problem. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan and...@dunslane.net writes: Well, I can tell you that it is getting an exit code of 1, which is why the postmaster isn't restarting. Blech. Count on Windows to find a way to break things. That raises two questions in my mind. First, is that the behaviour we expect when we kill the backend this way? And second, why is it still showing up in the output of pg_stat_activity? Well, if the process is being hard-killed without an opportunity to run through proc_exit(), then yes it is going to still show up in pg_stat_activity. It's pgstat_beshutdown_hook that removes that entry. The problem here is that we need to be able to distinguish a task manager kill from a voluntary exit(1). Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
I wrote: Andrew Dunstan and...@dunslane.net writes: Well, I can tell you that it is getting an exit code of 1, which is why the postmaster isn't restarting. Blech. Count on Windows to find a way to break things. I reflected on this a bit more. Even if we find a way around this particular task-manager behavior, it seems to me there is a generic problem here. If some bit of clueless code does exit(0) or exit(1) inside a backend session, the postmaster will think everything is fine, but actually we have an un-cleaned-up session that's probably still holding locks etc. It's fairly easy to demonstrate the issue: pl_regression=# create language plperlu; CREATE LANGUAGE pl_regression=# create or replace function trouble() returns void as pl_regression-# $$ exit 0; $$ language plperlu; CREATE FUNCTION pl_regression=# select trouble(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. pl_regression=# select * from pg_stat_activity; datid |datname| procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port ---+---+-+--+-+-+-+---+---+---+-+- 40179 | pl_regression | 20847 | 10 | tgl | select trouble(); | f | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 | | -1 40179 | pl_regression | 20855 | 10 | tgl | select * from pg_stat_activity; | f | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 | | -1 (2 rows) Up to now we've always just dismissed the above possibility as superusers should know better, but I think there's a reasonable case to be made that this is an obvious failure mode and we should put a bit more effort into being robust against it. With more and more external code being routinely run in the backend, who wants to swear that there is no exit(1) in the guts of libperl or libxml or whatever? The first idea that comes to mind is to have some sort of dead man switch that flags an active backend and is reset by proc_exit() after it's finished cleaning up everything else. If the postmaster sees this flag still set after backend exit, then it treats the backend as having crashed regardless of what the reported exit code is. We could implement this via an array of sig_atomic_t in shared memory, so as to minimize the postmaster's entanglement with shared memory (it'd be no worse than the old WIN32-specific child pid arrays). Or maybe there's a better way. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane kind of :-( Everything i have read seems to point the Task Manager calls TerminateProcess() in the kernel32.dll and passes a 1 setting the exitcode to 1. I have not found anything clearly stating that, yet logic points that is what its doing. http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx Would it not be easy to set the normal exitcode to something other than 1 to see the difference ExitProcess() http://msdn.microsoft.com/en-us/library/ms682658(VS.85).aspx this article maybe helpful. http://msdn.microsoft.com/en-us/library/ms686722(VS.85).aspx BTW tested on Windows 2003 SP1 this problem shows up in Postgresql 8.3.3
Re: [HACKERS] windows doesn't notice backend death
justin wrote: Tom Lane wrote: Have M$ really been stupid enough to make an external kill look just like an exit() call? regards, tom lane kind of :-( Would it not be easy to set the normal exitcode to something other than 1 to see the difference ExitProcess() Not really, as Tom showed later this is an example of a more general problem. I think his solution of detecting when backends have cleaned up nicely and when they have not is the right way to go. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan wrote: justin wrote: Would it not be easy to set the normal exitcode to something other than 1 to see the difference ExitProcess() Not really, as Tom showed later this is an example of a more general problem. I think his solution of detecting when backends have cleaned up nicely and when they have not is the right way to go. cheers andrew Stupid thought why can the some clueless code set the exit status to crashed status??? Would it not be more prudent to remove that ability??? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
justin wrote: Stupid thought why can the some clueless code set the exit status to crashed status??? Would it not be more prudent to remove that ability??? You're missing the point. The danger comes from code that we don't control. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
I wrote: The first idea that comes to mind is to have some sort of dead man switch that flags an active backend and is reset by proc_exit() after it's finished cleaning up everything else. If the postmaster sees this flag still set after backend exit, then it treats the backend as having crashed regardless of what the reported exit code is. Another thought that came to mind: we could set up an atexit hook that does all the work that proc_exit() currently does, and reduce proc_exit() itself to just an exit() call. psql already relies on having atexit (or on_exit) so this doesn't appear to add any new portability issues. This will probably not fix the Vista taskmanager issue, since I'll bet it's not running atexit hooks anyway. What it would do is improve the situation so that a clueless exit() call would be no worse than elog(FATAL), rather than triggering a DB-wide restart as the dead man switch would do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Up to now we've always just dismissed the above possibility as superusers should know better, but I think there's a reasonable case to be made that this is an obvious failure mode and we should put a bit more effort into being robust against it. With more and more external code being routinely run in the backend, who wants to swear that there is no exit(1) in the guts of libperl or libxml or whatever? FWIW there is (or there was, last time I looked) an exit(1) call in the guts of the PHP library that PL/php uses, which is triggered when the memory used goes over the configured memory limit. It was very easily triggered with some of the test functions we had on our regression tests, and the only solution was to kludge up the limit. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Alvaro Herrera alvhe...@commandprompt.com writes: FWIW there is (or there was, last time I looked) an exit(1) call in the guts of the PHP library that PL/php uses, which is triggered when the memory used goes over the configured memory limit. It was very easily triggered with some of the test functions we had on our regression tests, and the only solution was to kludge up the limit. I don't think we'll be able to prevent PHP from doing that :-(. But it now seems clear that we should try to make the database as a whole recover with some degree of grace. I'll go work up a patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] windows doesn't notice backend death
While trying to test Tom's theory about Windows not being able to reinitialise shared memory when a backend crashes, I was doing some testing on my Vista box. I killed a backend using the task manager, and the postmaster never noticed. pg_cancel_backend() reported correctly that the process didn't exist, but pg_stat_activity still had it listed as IDLE, and the server certainly didn't re-initialise as it would have on Unix. This seems a fairly serious bug. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan and...@dunslane.net writes: ... I killed a backend using the task manager, and the postmaster never noticed. Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: ... I killed a backend using the task manager, and the postmaster never noticed. Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? If that were the case then it would not have the dead process still listed as a live backend, ISTM, which it does. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan and...@dunslane.net writes: Tom Lane wrote: Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? If that were the case then it would not have the dead process still listed as a live backend, ISTM, which it does. The postmaster does not control the content of the pg_stat_activity view. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Tom Lane wrote: Ick. Is it possible that the postmaster did get a report, but thought it was normal session termination? If so, how could we distinguish? If that were the case then it would not have the dead process still listed as a live backend, ISTM, which it does. The postmaster does not control the content of the pg_stat_activity view. Well, I'm not I know how to find out the answer to your question. I could try attaching a debugger to the postmaster - if I knew where to put a breakpoint. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] windows doesn't notice backend death
Andrew Dunstan wrote: Well, I'm not I know how to find out the answer to your question. I could try attaching a debugger to the postmaster - if I knew where to put a breakpoint. Did you try reaper()? It's the SIGCHLD handler. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers