[HACKERS] occasional startup failures

2012-03-25 Thread Andrew Dunstan


Every so often buildfarm animals (nightjar and raven recently, for 
example) report failures on starting up the postmaster. It appears that 
these failures are due to the postmaster not creating the pid file 
within 5 seconds, and so the logic in commit 
0bae3bc9be4a025df089f0a0c2f547fa538a97bc kicks in. Unfortunately, when 
this happens the postmaster has in fact sometimes started up, and the 
end result is that subsequent buildfarm runs will fail when they detect 
that there is already a postmaster listening on the port, and without 
manual intervention to kill the rogue postmaster this continues endlessly.


I can probably add some logic to the buildfarm script to try to detect 
this condition and kill an errant postmaster so subsequent runs don't 
get affected, but that seems to be avoiding a problem rather than fixing 
it. I'm not sure what we can do to improve it otherwise, though.


Thoughts?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] occasional startup failures

2012-03-25 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Every so often buildfarm animals (nightjar and raven recently, for 
 example) report failures on starting up the postmaster. It appears that 
 these failures are due to the postmaster not creating the pid file 
 within 5 seconds, and so the logic in commit 
 0bae3bc9be4a025df089f0a0c2f547fa538a97bc kicks in. Unfortunately, when 
 this happens the postmaster has in fact sometimes started up, and the 
 end result is that subsequent buildfarm runs will fail when they detect 
 that there is already a postmaster listening on the port, and without 
 manual intervention to kill the rogue postmaster this continues endlessly.

 I can probably add some logic to the buildfarm script to try to detect 
 this condition and kill an errant postmaster so subsequent runs don't 
 get affected, but that seems to be avoiding a problem rather than fixing 
 it. I'm not sure what we can do to improve it otherwise, though.

Yeah, this has been discussed before.  IMO the only real fix is to
arrange things so that the postmaster process is an immediate child of
pg_ctl, allowing pg_ctl to know its PID directly and not have to rely
on the pidfile appearing before it can detect whether the postmaster
is still alive.  Then there is no need for a guesstimated timeout.
That means not using system() anymore, but rather fork/exec, which
mainly implies having to write our own code for stdio redirection.
So that's certainly doable if a bit tedious.  I have no idea about
the Windows side of it though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] occasional startup failures

2012-03-25 Thread Magnus Hagander
On Sun, Mar 25, 2012 at 18:59, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 Every so often buildfarm animals (nightjar and raven recently, for
 example) report failures on starting up the postmaster. It appears that
 these failures are due to the postmaster not creating the pid file
 within 5 seconds, and so the logic in commit
 0bae3bc9be4a025df089f0a0c2f547fa538a97bc kicks in. Unfortunately, when
 this happens the postmaster has in fact sometimes started up, and the
 end result is that subsequent buildfarm runs will fail when they detect
 that there is already a postmaster listening on the port, and without
 manual intervention to kill the rogue postmaster this continues endlessly.

 I can probably add some logic to the buildfarm script to try to detect
 this condition and kill an errant postmaster so subsequent runs don't
 get affected, but that seems to be avoiding a problem rather than fixing
 it. I'm not sure what we can do to improve it otherwise, though.

 Yeah, this has been discussed before.  IMO the only real fix is to
 arrange things so that the postmaster process is an immediate child of
 pg_ctl, allowing pg_ctl to know its PID directly and not have to rely
 on the pidfile appearing before it can detect whether the postmaster
 is still alive.  Then there is no need for a guesstimated timeout.
 That means not using system() anymore, but rather fork/exec, which
 mainly implies having to write our own code for stdio redirection.
 So that's certainly doable if a bit tedious.  I have no idea about
 the Windows side of it though.

We already do something like this on Win32 - at least one reason being
dealing with restricted tokens. Right now we just close the handles to
the child, but we could easily keep those around for doing this type
of detection.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers