Tom Lane wrote:
A number of the buildfarm machines have been failing HEAD builds
at the "make check" stage since last night, with complaints like
this one from emu:
================== pgsql.21911/src/test/regress/log/postmaster.log 
===================
FATAL:  lock file "/tmp/.s.PGSQL.55678.lock" already exists
HINT:  Is another postmaster (PID 23692) using socket file 
"/tmp/.s.PGSQL.55678"?

What's happened is that that GUC patch that was in the tree for a few
hours broke postmaster startup on some machines (for as-yet-unidentified
reasons).  The postmaster does actually start and establish its
lockfiles, but it never gets to the stage of being able to accept
connections.

After the buildfarm script rm -rf's the build tree, the postmaster
process is still there but "disembodied" (its executable file is
probably gone, for example, or at least in the state of zero remaining
directory links).  But it's still got that socket file and lockfile
in /tmp, and this prevents another postmaster from starting with the
same port number.

If you've got this situation, you'll need to do a manual "kill" on the
PID mentioned in the lock file before things will start working again.
(pg_ctl won't work because it looks for the data directory
postmaster.pid file, which is long gone.)  More generally you might want
to look through a ps listing for unexpected postgres-owned processes.

I'm not sure whether there's anything much we can do to prevent such
problems in future.  Maybe it'd be reasonable for pg_regress to do a
kill -9 on its postmaster child process if it gives up waiting for the
postmaster to accept connections.



That's amazingly ugly, and well diagnosed.

BTW, buildfarm processes would typically not be postgres owned, at least not on my machines. I run either as myself or as a special buildfarm user.

I'm trying to think how we could harden the buildfarm script to avoid such situations, although I am so far without any great revelations.

The idea of getting pg_regress to send a signal isn't bad - what if the PID gets reused, since we know not all systems allocate PIDs in a cyclical fashion?

Also, I see the pg-regress code has this comment:

           /*
            * Fail immediately if postmaster has exited
            *
            * XXX is there a way to do this on Windows?
            */

As I understand it, the way to do it is to call OpenProcess() - if that succeeds then it is still there. I guess if needed we could even do that in src/port/kill.c so that kill(pid,0) would work. But I would want confirmation from the Windows gurus.


cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to