Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

Michael Paquier Thu, 08 Oct 2015 21:48:00 -0700

On Thu, Oct 8, 2015 at 3:09 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 11:52 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paqu...@gmail.com> writes:
>>> On Sat, Sep 26, 2015 at 9:12 AM, Tom Lane wrote:
>>>> So the attached modified patch adjusts the PID-match logic and some
>>>> comments, but is otherwise what I posted before.  I believe that this
>>>> might actually work on Windows, but I have no way to test it.  Someone
>>>> please try that?  (Don't forget to test the service-start path, too.)
>>
>>> If "pg_ctl start" is invoked directly from a command prompt, it works.
>>> Now, if I run "pg_ctl start" within a script (in an msysgit
>>> environment for example), pg_ctl fails, complaining that
>>> postmaster.pid already exists. The TAP tests get broken as well,
>>
>> Reading that again, I think you mean that "pg_ctl start" works but you
>> didn't try "pg_ctl start -w" manually?  Because it's "-w" that's at
>> issue here, and the failing test cases are using that.
>
> Yes, sorry. I fat-fingered the command from the prompt and forgot the
> -w switch. test_postmaster_connection just behaves incorrectly, and
> exists immediately with PQPING_NO_RESPONSE, aka "Stopped waiting,
> could not start server, blah".
>
>> I think there is still room to salvage something without fully rewriting
>> the postmaster invocation logic to avoid using CMD, because it's still
>> true that checking whether the CMD process is still there should be as
>> good as checking the postmaster proper.  We just can't use kill() for it.
>> I'd be inclined to get rid of the use of kill() on Unix as well; as Noah
>> mentioned upthread, if we're using fork/exec we might as well pay
>> attention to waitpid's results instead.  On Windows, I imagine that the
>> thing to do is have start_postmaster() save aside the handle for the CMD
>> process, and then in test_postmaster_connection(), check the handle state
>> to see if the process is still running (I assume there's a way to do
>> that).
>
> That would be WaitForSingleObject(handle, ms_timeout) ==
> WAIT_OBJECT_0, no? The handle should be picked up from
> CreateRestrictedProcess, and I think that CloseHandle should not be
> called on pi.hProcess if we are going to wait for it afterwards.
> Reference:
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms687032%28v=vs.85%29.aspx


I had a look at that, and attached is an updated patch showing the
concept of using an HANDLE that pg_ctl waits for. I agree that saving
an HANDLE the way this patch does using a static variable is a bit
ugly especially knowing that service registration uses
test_postmaster_connection as well with directly a postmaster. The
good thing is that tapcheck runs smoothly, with one single failure
though: the second call to pg_ctl start -w may succeed instead of
failing if kicked within an interval of 3 seconds after the first one,
based on the tests on my Windows VM. My guess is that this is caused
by the fact that we monitor the shell process and not the postmaster
itself. Thoughts?
Regards,
-- 
Michael

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index ba189e7..81e0631 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -28,6 +28,7 @@
 #include <time.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/wait.h>
 #include <unistd.h>
 
 #ifdef HAVE_SYS_RESOURCE_H
@@ -110,6 +111,7 @@ static SERVICE_STATUS status;
 static SERVICE_STATUS_HANDLE hStatus = (SERVICE_STATUS_HANDLE) 0;
 static HANDLE shutdownHandles[2];
 static pid_t postmasterPID = -1;
+static HANDLE commandProcess = INVALID_HANDLE_VALUE;
 
 #define shutdownEvent	  shutdownHandles[0]
 #define postmasterProcess shutdownHandles[1]
@@ -153,10 +155,10 @@ static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo,
 static pgpid_t get_pgpid(bool is_status_request);
 static char **readfile(const char *path);
 static void free_readfile(char **optlines);
-static int	start_postmaster(void);
+static pgpid_t start_postmaster(void);
 static void read_post_opts(void);
 
-static PGPing test_postmaster_connection(bool);
+static PGPing test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint);
 static bool postmaster_is_alive(pid_t pid);
 
 #if defined(HAVE_GETRLIMIT) && defined(RLIMIT_CORE)
@@ -419,36 +421,70 @@ free_readfile(char **optlines)
  * start/test/stop routines
  */
 
-static int
+/*
+ * Start the postmaster and return its PID.
+ *
+ * Currently, on Windows what we return is the PID of the shell process
+ * that launched the postmaster (and, we trust, is waiting for it to exit).
+ * So the PID is usable for "is the postmaster still running" checks,
+ * but cannot be compared directly to postmaster.pid.
+ */
+static pgpid_t
 start_postmaster(void)
 {
 	char		cmd[MAXPGPATH];
 
 #ifndef WIN32
+	pgpid_t		pm_pid;
+
+	/* Flush stdio channels just before fork, to avoid double-output problems */
+	fflush(stdout);
+	fflush(stderr);
+
+	pm_pid = fork();
+	if (pm_pid < 0)
+	{
+		/* fork failed */
+		write_stderr(_("%s: could not start server: %s\n"),
+					 progname, strerror(errno));
+		exit(1);
+	}
+	if (pm_pid > 0)
+	{
+		/* fork succeeded, in parent */
+		return pm_pid;
+	}
+
+	/* fork succeeded, in child */
 
 	/*
 	 * Since there might be quotes to handle here, it is easier simply to pass
-	 * everything to a shell to process them.
-	 *
-	 * XXX it would be better to fork and exec so that we would know the child
-	 * postmaster's PID directly; then test_postmaster_connection could use
-	 * the PID without having to rely on reading it back from the pidfile.
+	 * everything to a shell to process them.  Use exec so that the postmaster
+	 * has the same PID as the current child process.
 	 */
 	if (log_file != NULL)
-		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1 &",
+		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
 				 exec_path, pgdata_opt, post_opts,
 				 DEVNULL, log_file);
 	else
-		snprintf(cmd, MAXPGPATH, "\"%s\" %s%s < \"%s\" 2>&1 &",
+		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
 				 exec_path, pgdata_opt, post_opts, DEVNULL);
 
-	return system(cmd);
+	(void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
+
+	/* exec failed */
+	write_stderr(_("%s: could not start server: %s\n"),
+				 progname, strerror(errno));
+	exit(1);
+
+	return 0;					/* keep dumb compilers quiet */
+
 #else							/* WIN32 */
 
 	/*
-	 * On win32 we don't use system(). So we don't need to use & (which would
-	 * be START /B on win32). However, we still call the shell (CMD.EXE) with
-	 * it to handle redirection etc.
+	 * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
+	 * handle redirection etc.  Unfortunately CMD.EXE lacks any equivalent of
+	 * "exec", so we don't get to find out the postmaster's PID immediately.
 	 */
 	PROCESS_INFORMATION pi;
 
@@ -460,10 +496,15 @@ start_postmaster(void)
 				 exec_path, pgdata_opt, post_opts, DEVNULL);
 
 	if (!CreateRestrictedProcess(cmd, &pi, false))
-		return GetLastError();
-	CloseHandle(pi.hProcess);
+	{
+		write_stderr(_("%s: could not start server: error code %lu\n"),
+					 progname, (unsigned long) GetLastError());
+		exit(1);
+	}
+	/* Don't close pi.hProcess here, we wait for it afterwards */
+	commandProcess = pi.hProcess;
 	CloseHandle(pi.hThread);
-	return 0;
+	return pi.dwProcessId;		/* Shell's PID, not postmaster's! */
 #endif   /* WIN32 */
 }
 
@@ -472,15 +513,17 @@ start_postmaster(void)
 /*
  * Find the pgport and try a connection
  *
+ * On Unix, pm_pid is the PID of the just-launched postmaster.  On Windows,
+ * it's the PID of an ancestor shell process, so we can't check the contents
+ * of postmaster.pid quite as carefully.
+ *
  * Note that the checkpoint parameter enables a Windows service control
  * manager checkpoint, it's got nothing to do with database checkpoints!!
  */
 static PGPing
-test_postmaster_connection(bool do_checkpoint)
+test_postmaster_connection(pgpid_t pm_pid, bool do_checkpoint)
 {
 	PGPing		ret = PQPING_NO_RESPONSE;
-	bool		found_stale_pidfile = false;
-	pgpid_t		pm_pid = 0;
 	char		connstr[MAXPGPATH * 2 + 256];
 	int			i;
 
@@ -535,29 +578,27 @@ test_postmaster_connection(bool do_checkpoint)
 						 optlines[5] != NULL)
 				{
 					/* File is complete enough for us, parse it */
-					long		pmpid;
+					pgpid_t		pmpid;
 					time_t		pmstart;
 
 					/*
-					 * Make sanity checks.  If it's for a standalone backend
-					 * (negative PID), or the recorded start time is before
-					 * pg_ctl started, then either we are looking at the wrong
-					 * data directory, or this is a pre-existing pidfile that
-					 * hasn't (yet?) been overwritten by our child postmaster.
-					 * Allow 2 seconds slop for possible cross-process clock
-					 * skew.
+					 * Make sanity checks.  If it's for the wrong PID, or the
+					 * recorded start time is before pg_ctl started, then
+					 * either we are looking at the wrong data directory, or
+					 * this is a pre-existing pidfile that hasn't (yet?) been
+					 * overwritten by our child postmaster.  Allow 2 seconds
+					 * slop for possible cross-process clock skew.
 					 */
 					pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
 					pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
-					if (pmpid <= 0 || pmstart < start_time - 2)
-					{
-						/*
-						 * Set flag to report stale pidfile if it doesn't get
-						 * overwritten before we give up waiting.
-						 */
-						found_stale_pidfile = true;
-					}
-					else
+					if (pmstart >= start_time - 2 &&
+#ifndef WIN32
+						pmpid == pm_pid
+#else
+					/* Windows can only reject standalone-backend PIDs */
+						pmpid > 0
+#endif
+						)
 					{
 						/*
 						 * OK, seems to be a valid pidfile from our child.
@@ -567,9 +608,6 @@ test_postmaster_connection(bool do_checkpoint)
 						char	   *hostaddr;
 						char		host_str[MAXPGPATH];
 
-						found_stale_pidfile = false;
-						pm_pid = (pgpid_t) pmpid;
-
 						/*
 						 * Extract port number and host string to use. Prefer
 						 * using Unix socket if available.
@@ -635,42 +673,34 @@ test_postmaster_connection(bool do_checkpoint)
 		}
 
 		/*
-		 * The postmaster should create postmaster.pid very soon after being
-		 * started.  If it's not there after we've waited 5 or more seconds,
-		 * assume startup failed and give up waiting.  (Note this covers both
-		 * cases where the pidfile was never created, and where it was created
-		 * and then removed during postmaster exit.)  Also, if there *is* a
-		 * file there but it appears stale, issue a suitable warning and give
-		 * up waiting.
+		 * Reap child process if it died; if it remains in zombie state then
+		 * our kill-based test will think it's still running.
 		 */
-		if (i >= 5)
+#ifndef WIN32
 		{
-			struct stat statbuf;
+			int			exitstatus;
 
-			if (stat(pid_file, &statbuf) != 0)
-			{
-				if (errno != ENOENT)
-					write_stderr(_("\n%s: could not stat file \"%s\": %s\n"),
-								 progname, pid_file, strerror(errno));
-				return PQPING_NO_RESPONSE;
-			}
-
-			if (found_stale_pidfile)
-			{
-				write_stderr(_("\n%s: this data directory appears to be running a pre-existing postmaster\n"),
-							 progname);
-				return PQPING_NO_RESPONSE;
-			}
+			(void) waitpid((pid_t) pm_pid, &exitstatus, WNOHANG);
 		}
+#endif
 
 		/*
-		 * If we've been able to identify the child postmaster's PID, check
-		 * the process is still alive.  This covers cases where the postmaster
-		 * successfully created the pidfile but then crashed without removing
-		 * it.
+		 * Check whether the child postmaster process is still alive.  This
+		 * lets us exit early if the postmaster fails during startup.
+		 *
+		 * On Windows, we are checking the postmaster's parent shell, but
+		 * that's fine for this purpose.
 		 */
-		if (pm_pid > 0 && !postmaster_is_alive((pid_t) pm_pid))
+#ifdef WIN32
+		if (commandProcess != INVALID_HANDLE_VALUE)
+		{
+			if (WaitForSingleObject(commandProcess, 0) == WAIT_OBJECT_0)
+				return PQPING_NO_RESPONSE;
+		}
+#else
+		if (!postmaster_is_alive((pid_t) pm_pid))
 			return PQPING_NO_RESPONSE;
+#endif
 
 		/* No response, or startup still in process; wait */
 #if defined(WIN32)
@@ -836,7 +866,7 @@ static void
 do_start(void)
 {
 	pgpid_t		old_pid = 0;
-	int			exitcode;
+	pgpid_t		pm_pid;
 
 	if (ctl_command != RESTART_COMMAND)
 	{
@@ -876,19 +906,13 @@ do_start(void)
 	}
 #endif
 
-	exitcode = start_postmaster();
-	if (exitcode != 0)
-	{
-		write_stderr(_("%s: could not start server: exit code was %d\n"),
-					 progname, exitcode);
-		exit(1);
-	}
+	pm_pid = start_postmaster();
 
 	if (do_wait)
 	{
 		print_msg(_("waiting for server to start..."));
 
-		switch (test_postmaster_connection(false))
+		switch (test_postmaster_connection(pm_pid, false))
 		{
 			case PQPING_OK:
 				print_msg(_(" done\n"));
@@ -914,6 +938,10 @@ do_start(void)
 	}
 	else
 		print_msg(_("server starting\n"));
+
+#ifdef WIN32
+	CloseHandle(commandProcess);
+#endif
 }
 
 
@@ -1585,7 +1613,7 @@ pgwin32_ServiceMain(DWORD argc, LPTSTR *argv)
 	if (do_wait)
 	{
 		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
-		if (test_postmaster_connection(true) != PQPING_OK)
+		if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
 		{
 			write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
 			pgwin32_SetServiceStatus(SERVICE_STOPPED);
@@ -1606,8 +1634,7 @@ pgwin32_ServiceMain(DWORD argc, LPTSTR *argv)
 			{
 				/*
 				 * status.dwCheckPoint can be incremented by
-				 * test_postmaster_connection(true), so it might not start
-				 * from 0.
+				 * test_postmaster_connection(), so it might not start from 0.
 				 */
 				int			maxShutdownCheckPoint = status.dwCheckPoint + 12;;
 
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 177a16f..5d15813 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -34,8 +34,9 @@ else
 close CONF;
 command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
 	'pg_ctl start -w');
-command_ok([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
-	'second pg_ctl start succeeds');
+sleep 3 if ($windows_os);
+command_fails([ 'pg_ctl', 'start', '-D', "$tempdir/data", '-w' ],
+	'second pg_ctl start -w fails');
 command_ok([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
 	'pg_ctl stop -w');
 command_fails([ 'pg_ctl', 'stop', '-D', "$tempdir/data", '-w', '-m', 'fast' ],
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 02533eb..0835b14 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -286,6 +286,7 @@ sub command_fails
 {
 	my ($cmd, $test_name) = @_;
 	my $result = run_log($cmd);
+	print("# Result: $result\n");
 	ok(!$result, $test_name);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

Reply via email to