On Tue, Aug 24, 2010 at 9:58 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Bruce Momjian <br...@momjian.us> writes: >> Robert Haas wrote: >>> Yeah, that seems very plausible, although exactly how to verify I don't >>> know. > >> And here is confirmation from the Microsoft web site: > >> In some instances, calling GetExitCode() against the failed process >> indicates the following exit code: >> 128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for. > > Given the existence of the deadman switch mechanism (which I hadn't > remembered when this thread started), I'm coming around to the idea that > we could just treat exit(128) as nonfatal on Windows. If for some > reason the child hadn't died instantly at startup, the deadman switch > would distinguish that from the case described here.
So the options are: (1) If running on Windows and the exit code is 128 and the deadman switch is not engaged, don't crash-and-restart. (2) If running on Windows, create a mutex in the parent process and take it in the child; if the mutex has not been taken, don't crash-and-restart. There is some amount of user code (I'm not sure preceisely how much) that runs after shared memory is mapped and before the deadman switch is engaged. If we go with option #1, it would probably behoove us to try to minimize the amount of such code (at least in HEAD). There is probably not a great deal of danger that we could manage to scribble on shared memory and then exit normally (rather than via signal), never mind the need to exit with exactly 128. But "not a great deal" is not the same as "none". If we go with option #2, the principal danger seems to be that the code Magnus wrote will turn out to be less robust than we might hope; for example, it might not work on all versions of Windows, or be prone to some other installation-dependent mischief. Another question is how far either of these fixes could be back-patched. I believe the dead-man switch only exists as far back as 8.4, but the original commit message mentioned the possibility of eventually back-patching it further: Although this problem is of long standing, the lack of field complaints seems to mean it's not critical enough to risk back-patching; at least not till we get some more testing of this mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers