Attached is a lightly-tested draft patch that converts the postmaster to use a WaitEventSet for waiting in ServerLoop. I've got mixed emotions about whether this is the direction to proceed, though. It adds at least a couple of kernel calls per postmaster signal delivery, and probably to every postmaster connection acceptance (ServerLoop iteration), to fix problems that are so corner-casey that we've never even known we had them till now.
I'm looking longingly at pselect(2), ppoll(2), and epoll_pwait(2), which would solve the problem without need for a self-pipe. But the latter two are Linux-only, and while pselect does exist in recent POSIX editions, it's nonetheless got portability issues. Googling suggests that on a number of platforms, pselect is non-atomic, ie it's nothing but a wrapper for sigprocmask/select/sigprocmask, which would mean that the race condition I described in my previous message still exists. Despite that, a pretty attractive proposal is to do, essentially, #ifdef HAVE_PSELECT selres = pselect(nSockets, &rmask, NULL, NULL, &timeout, &UnBlockSig); #else PG_SETMASK(&UnBlockSig); selres = select(nSockets, &rmask, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); #endif This fixes the race on platforms where pselect exists and is correctly implemented, and we're no worse off than before where that's not true. The other component of the problem is the possibility that select() will restart if the signal is marked SA_RESTART. (Presumably that would apply to pselect too.) I am thinking that maybe the answer is "if it hurts, don't do it" --- that is, in the postmaster maybe we shouldn't use SA_RESTART, at least not for these signals. A different line of thought is to try to provide a bulletproof solution, but push the implementation problems down into latch.c --- that is, the goal would be to provide a pselect-like variant of WaitEventSetWait that is guaranteed to return if interrupted, as opposed to the current behavior where it's guaranteed not to. But that seems like quite a bit of work. Whether or not we decide to change over the postmaster.c code, I think it'd likely be a good idea to apply most or all of the attached changes in latch.c. Setting CLOEXEC on the relevant FDs is clearly a good thing, and the other changes will provide some safety if some preloaded extension decides to create a latch in the postmaster process. regards, tom lane
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6831342..658ba73 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** *** 79,88 **** #include <netdb.h> #include <limits.h> - #ifdef HAVE_SYS_SELECT_H - #include <sys/select.h> - #endif - #ifdef USE_BONJOUR #include <dns_sd.h> #endif --- 79,84 ---- *************** static bool LoadedSSL = false; *** 379,384 **** --- 375,386 ---- static DNSServiceRef bonjour_sdref = NULL; #endif + /* WaitEventSet that ServerLoop uses to wait for connections */ + static WaitEventSet *ServerWaitSet = NULL; + + /* Latch used to ensure that signals interrupt the wait in ServerLoop */ + static Latch PostmasterLatch; + /* * postmaster.c - function prototypes */ *************** static int ServerLoop(void); *** 409,415 **** static int BackendStartup(Port *port); static int ProcessStartupPacket(Port *port, bool SSLdone); static void processCancelRequest(Port *port, void *pkt); ! static int initMasks(fd_set *rmask); static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); --- 411,417 ---- static int BackendStartup(Port *port); static int ProcessStartupPacket(Port *port, bool SSLdone); static void processCancelRequest(Port *port, void *pkt); ! static void initServerWaitSet(void); static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); static bool RandomCancelKey(int32 *cancel_key); *************** checkDataDir(void) *** 1535,1541 **** } /* ! * Determine how long should we let ServerLoop sleep. * * In normal conditions we wait at most one minute, to ensure that the other * background tasks handled by ServerLoop get done even when no requests are --- 1537,1543 ---- } /* ! * Determine how long should we let ServerLoop sleep (in msec). * * In normal conditions we wait at most one minute, to ensure that the other * background tasks handled by ServerLoop get done even when no requests are *************** checkDataDir(void) *** 1543,1551 **** * we don't actually sleep so that they are quickly serviced. Other exception * cases are as shown in the code. */ ! static void ! DetermineSleepTime(struct timeval * timeout) { TimestampTz next_wakeup = 0; /* --- 1545,1554 ---- * we don't actually sleep so that they are quickly serviced. Other exception * cases are as shown in the code. */ ! static long ! DetermineSleepTime(void) { + long timeout; TimestampTz next_wakeup = 0; /* *************** DetermineSleepTime(struct timeval * time *** 1558,1582 **** if (AbortStartTime != 0) { /* time left to abort; clamp to 0 in case it already expired */ ! timeout->tv_sec = SIGKILL_CHILDREN_AFTER_SECS - (time(NULL) - AbortStartTime); ! timeout->tv_sec = Max(timeout->tv_sec, 0); ! timeout->tv_usec = 0; } else ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } ! return; } if (StartWorkerNeeded) ! { ! timeout->tv_sec = 0; ! timeout->tv_usec = 0; ! return; ! } if (HaveCrashedWorker) { --- 1561,1577 ---- if (AbortStartTime != 0) { /* time left to abort; clamp to 0 in case it already expired */ ! timeout = SIGKILL_CHILDREN_AFTER_SECS - (time(NULL) - AbortStartTime); ! timeout = Max(timeout, 0) * 1000; } else ! timeout = 60 * 1000; ! return timeout; } if (StartWorkerNeeded) ! return 0; if (HaveCrashedWorker) { *************** DetermineSleepTime(struct timeval * time *** 1617,1639 **** long secs; int microsecs; TimestampDifference(GetCurrentTimestamp(), next_wakeup, &secs, µsecs); ! timeout->tv_sec = secs; ! timeout->tv_usec = microsecs; /* Ensure we don't exceed one minute */ ! if (timeout->tv_sec > 60) ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } } else ! { ! timeout->tv_sec = 60; ! timeout->tv_usec = 0; ! } } /* --- 1612,1630 ---- long secs; int microsecs; + /* XXX this is stupidly inefficient */ TimestampDifference(GetCurrentTimestamp(), next_wakeup, &secs, µsecs); ! timeout = secs * 1000 + microsecs / 1000; /* Ensure we don't exceed one minute */ ! if (timeout > 60 * 1000) ! timeout = 60 * 1000; } else ! timeout = 60 * 1000; ! ! return timeout; } /* *************** DetermineSleepTime(struct timeval * time *** 1644,1662 **** static int ServerLoop(void) { - fd_set readmask; - int nSockets; time_t last_lockfile_recheck_time, last_touch_time; last_lockfile_recheck_time = last_touch_time = time(NULL); ! nSockets = initMasks(&readmask); for (;;) { ! fd_set rmask; ! int selres; time_t now; /* --- 1635,1651 ---- static int ServerLoop(void) { time_t last_lockfile_recheck_time, last_touch_time; last_lockfile_recheck_time = last_touch_time = time(NULL); ! initServerWaitSet(); for (;;) { ! WaitEvent event; ! int nevents; time_t now; /* *************** ServerLoop(void) *** 1667,1742 **** * do nontrivial work. * * If we are in PM_WAIT_DEAD_END state, then we don't want to accept ! * any new connections, so we don't call select(), and just sleep. */ - memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set)); - if (pmState == PM_WAIT_DEAD_END) { PG_SETMASK(&UnBlockSig); pg_usleep(100000L); /* 100 msec seems reasonable */ ! selres = 0; PG_SETMASK(&BlockSig); } else { ! /* must set timeout each time; some OSes change it! */ ! struct timeval timeout; /* Needs to run with blocked signals! */ ! DetermineSleepTime(&timeout); PG_SETMASK(&UnBlockSig); ! selres = select(nSockets, &rmask, NULL, NULL, &timeout); PG_SETMASK(&BlockSig); } - /* Now check the select() result */ - if (selres < 0) - { - if (errno != EINTR && errno != EWOULDBLOCK) - { - ereport(LOG, - (errcode_for_socket_access(), - errmsg("select() failed in postmaster: %m"))); - return STATUS_ERROR; - } - } - /* ! * New connection pending on any of our sockets? If so, fork a child ! * process to deal with it. */ ! if (selres > 0) { ! int i; ! ! for (i = 0; i < MAXLISTEN; i++) { ! if (ListenSocket[i] == PGINVALID_SOCKET) ! break; ! if (FD_ISSET(ListenSocket[i], &rmask)) ! { ! Port *port; ! port = ConnCreate(ListenSocket[i]); ! if (port) ! { ! BackendStartup(port); ! /* ! * We no longer need the open socket or port structure ! * in this process ! */ ! StreamClose(port->sock); ! ConnFree(port); ! } } } } /* If we have lost the log collector, try to start a new one */ --- 1656,1719 ---- * do nontrivial work. * * If we are in PM_WAIT_DEAD_END state, then we don't want to accept ! * any new connections, so we don't call WaitEventSetWait(), and just ! * sleep. XXX not ideal */ if (pmState == PM_WAIT_DEAD_END) { PG_SETMASK(&UnBlockSig); pg_usleep(100000L); /* 100 msec seems reasonable */ ! nevents = 0; PG_SETMASK(&BlockSig); } else { ! long timeout; /* Needs to run with blocked signals! */ ! timeout = DetermineSleepTime(); PG_SETMASK(&UnBlockSig); ! nevents = WaitEventSetWait(ServerWaitSet, timeout, &event, 1, 0); PG_SETMASK(&BlockSig); } /* ! * Deal with detected event, if any. */ ! if (nevents > 0) { ! /* ! * New connection pending on any of our sockets? If so, fork a ! * child process to deal with it. ! */ ! if (event.events & WL_SOCKET_READABLE) { ! Port *port; ! port = ConnCreate(event.fd); ! if (port) ! { ! BackendStartup(port); ! /* ! * We no longer need the open socket or port structure in ! * this process ! */ ! StreamClose(port->sock); ! ConnFree(port); } } + + /* + * Latch set? If so, clear it. + */ + else if (event.events & WL_LATCH_SET) + ResetLatch(&PostmasterLatch); } /* If we have lost the log collector, try to start a new one */ *************** ServerLoop(void) *** 1874,1903 **** } /* ! * Initialise the masks for select() for the ports we are listening on. ! * Return the number of sockets to listen on. */ ! static int ! initMasks(fd_set *rmask) { - int maxsock = -1; int i; ! FD_ZERO(rmask); for (i = 0; i < MAXLISTEN; i++) { ! int fd = ListenSocket[i]; ! if (fd == PGINVALID_SOCKET) break; ! FD_SET(fd, rmask); ! ! if (fd > maxsock) ! maxsock = fd; } ! return maxsock + 1; } --- 1851,1879 ---- } /* ! * Create a WaitEventSet for ServerLoop() to wait on. This includes the ! * sockets we are listening on, plus the PostmasterLatch. */ ! static void ! initServerWaitSet(void) { int i; ! ServerWaitSet = CreateWaitEventSet(PostmasterContext, MAXLISTEN + 1); for (i = 0; i < MAXLISTEN; i++) { ! pgsocket sock = ListenSocket[i]; ! if (sock == PGINVALID_SOCKET) break; ! AddWaitEventToSet(ServerWaitSet, WL_SOCKET_READABLE, sock, NULL, NULL); } ! InitializeLatchSupport(); ! InitLatch(&PostmasterLatch); ! AddWaitEventToSet(ServerWaitSet, WL_LATCH_SET, PGINVALID_SOCKET, ! &PostmasterLatch, NULL); } *************** ClosePostmasterPorts(bool am_syslogger) *** 2436,2441 **** --- 2412,2431 ---- postmaster_alive_fds[POSTMASTER_FD_OWN] = -1; #endif + /* + * Close the postmaster's WaitEventSet, if any, to be sure that child + * processes don't accidentally mess with underlying file descriptors. + * (Note: in an EXEC_BACKEND build, this won't do anything because we + * didn't inherit the static pointer, much less the data structure itself. + * But it's OK because WaitEventSets don't contain any resources that can + * be inherited across exec().) + */ + if (ServerWaitSet) + { + FreeWaitEventSet(ServerWaitSet); + ServerWaitSet = NULL; + } + /* Close the listen sockets */ for (i = 0; i < MAXLISTEN; i++) { *************** SIGHUP_handler(SIGNAL_ARGS) *** 2553,2558 **** --- 2543,2551 ---- #endif } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; *************** pmdie(SIGNAL_ARGS) *** 2724,2729 **** --- 2717,2725 ---- break; } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; *************** reaper(SIGNAL_ARGS) *** 3037,3042 **** --- 3033,3041 ---- */ PostmasterStateMachine(); + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + /* Done with signal handler */ PG_SETMASK(&UnBlockSig); *************** sigusr1_handler(SIGNAL_ARGS) *** 5078,5083 **** --- 5077,5085 ---- signal_child(StartupPID, SIGUSR2); } + /* Force ServerLoop to iterate */ + SetLatch(&PostmasterLatch); + PG_SETMASK(&UnBlockSig); errno = save_errno; diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4798370..92d2ff0 100644 *** a/src/backend/storage/ipc/latch.c --- b/src/backend/storage/ipc/latch.c *************** *** 62,67 **** --- 62,71 ---- #include "storage/pmsignal.h" #include "storage/shmem.h" + #ifndef FD_CLOEXEC + #define FD_CLOEXEC 1 + #endif + /* * Select the fd readiness primitive to use. Normally the "most modern" * primitive supported by the OS will be used, but for testing it can be *************** static volatile sig_atomic_t waiting = f *** 130,135 **** --- 134,142 ---- static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; + /* Process owning the self-pipe --- needed for checking purposes */ + static int selfpipe_owner_pid = 0; + /* Private function prototypes */ static void sendSelfPipeByte(void); static void drainSelfPipe(void); *************** InitializeLatchSupport(void) *** 158,164 **** #ifndef WIN32 int pipefd[2]; ! Assert(selfpipe_readfd == -1); /* * Set up the self-pipe that allows a signal handler to wake up the --- 165,211 ---- #ifndef WIN32 int pipefd[2]; ! if (IsUnderPostmaster) ! { ! /* ! * We might have inherited connections to a self-pipe created by the ! * postmaster. It's critical that child processes create their own ! * self-pipes, of course, and we really want them to close the ! * inherited FDs for safety's sake. On most platforms we can use ! * F_SETFD/FD_CLOEXEC to make sure that happens, but if we lack that ! * facility, close the FDs the hard way. ! */ ! if (selfpipe_owner_pid != 0) ! { ! /* Assert we go through here but once in a child process */ ! Assert(selfpipe_owner_pid != MyProcPid); ! #ifndef F_SETFD ! /* Release postmaster's pipe FDs, if we lack FD_CLOEXEC */ ! close(selfpipe_readfd); ! close(selfpipe_writefd); ! #endif ! /* Clean up, just for safety's sake; we'll set these below */ ! selfpipe_readfd = selfpipe_writefd = -1; ! selfpipe_owner_pid = 0; ! } ! else ! { ! /* ! * Postmaster didn't create a self-pipe ... or else we're in an ! * EXEC_BACKEND build, and don't know because we didn't inherit ! * values for the static variables. (If we lack FD_CLOEXEC we'll ! * fail to close the inherited FDs, but that seems acceptable ! * since EXEC_BACKEND builds aren't meant for production on Unix.) ! */ ! Assert(selfpipe_readfd == -1); ! } ! } ! else ! { ! /* In postmaster or standalone backend, assert we do this but once */ ! Assert(selfpipe_readfd == -1); ! Assert(selfpipe_owner_pid == 0); ! } /* * Set up the self-pipe that allows a signal handler to wake up the *************** InitializeLatchSupport(void) *** 166,188 **** * SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl() failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl() failed on write-end of self-pipe: %m"); selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a backend-local latch. */ void InitLatch(volatile Latch *latch) --- 213,244 ---- * SetLatch won't block if the event has already been set many times * filling the kernel buffer. Make the read-end non-blocking too, so that * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK. + * Also, if possible, make both FDs close-on-exec, since we surely do not + * want any child processes messing with them. */ if (pipe(pipefd) < 0) elog(FATAL, "pipe() failed: %m"); if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl(F_SETFL) failed on read-end of self-pipe: %m"); if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0) ! elog(FATAL, "fcntl(F_SETFL) failed on write-end of self-pipe: %m"); ! #ifdef F_SETFD ! if (fcntl(pipefd[0], F_SETFD, FD_CLOEXEC) < 0) ! elog(FATAL, "fcntl(F_SETFD) failed on read-end of self-pipe: %m"); ! if (fcntl(pipefd[1], F_SETFD, FD_CLOEXEC) < 0) ! elog(FATAL, "fcntl(F_SETFD) failed on write-end of self-pipe: %m"); ! #endif selfpipe_readfd = pipefd[0]; selfpipe_writefd = pipefd[1]; + selfpipe_owner_pid = MyProcPid; #else /* currently, nothing to do here for Windows */ #endif } /* ! * Initialize a process-local latch. */ void InitLatch(volatile Latch *latch) *************** InitLatch(volatile Latch *latch) *** 193,199 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) --- 249,255 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #else latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) *************** InitLatch(volatile Latch *latch) *** 211,216 **** --- 267,276 ---- * containing the latch with ShmemInitStruct. (The Unix implementation * doesn't actually require that, but the Windows one does.) Because of * this restriction, we have no concurrency issues to worry about here. + * + * Note that other handles created in this module are never marked as + * inheritable. Thus we do not need to worry about cleaning up child + * process references to postmaster-private latches or WaitEventSets. */ void InitSharedLatch(volatile Latch *latch) *************** OwnLatch(volatile Latch *latch) *** 256,262 **** #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0); #endif if (latch->owner_pid != 0) --- 316,322 ---- #ifndef WIN32 /* Assert InitializeLatchSupport has been called in this process */ ! Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); #endif if (latch->owner_pid != 0) *************** DisownLatch(volatile Latch *latch) *** 289,295 **** * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * backend-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note --- 349,355 ---- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible. * * The latch must be owned by the current process, ie. it must be a ! * process-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * * Returns bit mask indicating which condition(s) caused the wake-up. Note *************** CreateWaitEventSet(MemoryContext context *** 531,536 **** --- 591,600 ---- set->epoll_fd = epoll_create(nevents); if (set->epoll_fd < 0) elog(ERROR, "epoll_create failed: %m"); + /* Attempt to make the FD non-inheritable; if fail, no big deal */ + #ifdef F_SETFD + (void) fcntl(set->epoll_fd, F_SETFD, FD_CLOEXEC); + #endif #elif defined(WAIT_USE_WIN32) /* *************** CreateWaitEventSet(MemoryContext context *** 551,556 **** --- 615,626 ---- /* * Free a previously created WaitEventSet. + * + * Note: preferably, this shouldn't have to free any resources that could be + * inherited across an exec(). If it did, we'd likely leak those resources in + * many scenarios. For the epoll() case, we attempt to ensure that by setting + * FD_CLOEXEC when the FD is created. For the Windows case, we assume that + * the handles involved are non-inheritable. */ void FreeWaitEventSet(WaitEventSet *set) *************** FreeWaitEventSet(WaitEventSet *set) *** 597,603 **** * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a backend-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are --- 667,673 ---- * used to modify previously added wait events using ModifyWaitEvent(). * * In the WL_LATCH_SET case the latch must be owned by the current process, ! * i.e. it must be a process-local latch initialized with InitLatch, or a * shared latch associated with the current process by calling OwnLatch. * * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers