Hello!

We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 times.

There are two different problems with pg_ctl:
1 - share access to postmaster.pid;

2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new cmd.exe.
waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file because it is being used by another process.
 stopped waiting
pg_ctl: could not start server
Examine the log output.

To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function to correctly handle share locks. On Windows systems we cannot handle ERROR_DELETE_PENDING because GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

To avoid the second error we added the wait for cmd.exe finish in the do_stop() and do_restart() functions.

Git patch is in the attachment.

On 2018-03-12 18:55, Badrul Chowdhury wrote:
Hi Tom,

This is a great catch. I am looking into it: I will start by
reproducing the error as you suggested.

Thanks,
Badrul

-----Original Message-----
From: Tom Lane [mailto:t...@sss.pgh.pa.us]
Sent: Saturday, March 10, 2018 2:48 PM
To: pgsql-hackers@lists.postgresql.org
Subject: Intermittent pg_ctl failures on Windows

The buildfarm's Windows members occasionally show weird pg_ctl
failures, for instance this recent case:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0

### Restarting node "master"
# Running: pg_ctl -D
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
-l
G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
restart waiting for server to shut down.... done server stopped
waiting for server to start....The process cannot access the file
because it is being used by another process.
 stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out!  system pg_ctl failed

or this one:

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0

### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
-m fast stop waiting for server to shut down....pg_ctl: could not open
PID file
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
Permission denied Bail out!  system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus
interference, but it suddenly occurred to me that there might be a
consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
for server start/stop, it is trying to read postmaster.pid
more-or-less concurrently with the postmaster writing to that file.
On Unix that's not much of a problem, but I believe that on Windows
you have to specifically open the file with sharing enabled, or you
get error messages like these.
The postmaster should be enabling sharing, because port.h redirects
open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
flags.
But it only does that #ifndef FRONTEND.  So pg_ctl is just using naked
open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the
problem if you modify the postmaster to hold postmaster.pid open
longer when rewriting it, e.g. stick fractional-second sleeps into
CreateLockFile and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix,
but I think somebody should.

                        regards, tom lane

--

regars, Roman Zharkov
From 7db244f6224d772a0c0fb26e6dfc38b2bad7a93e Mon Sep 17 00:00:00 2001
From: Roman Zharkov <r.zhar...@postgrespro.ru>
Date: Fri, 12 Jul 2019 13:01:58 +0700
Subject: [PATCH] [refer #PGPRO-2846] Avoid share locks on log_file between old
 and new cmd.exe when the server restarts. pg_ctl now opens the postmaster.pid
 file using pgwin32_open() function to correctly handle share locks. On
 Windows systems we cannot handle ERROR_DELETE_PENDING because GetLastError()
 returns ERROR_ACCESS_DENIED instead. So we rename the lock files before
 delete them. Discussion:
 https://www.postgresql.org/message-id/flat/16922.1520722108%40sss.pgh.pa.us

---
 src/backend/utils/init/miscinit.c |  18 ++++-
 src/bin/pg_ctl/pg_ctl.c           | 135 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ce929d8806..32a3e9520f 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -840,9 +840,21 @@ UnlinkLockFiles(int status, Datum arg)
 
 	foreach(l, lock_files)
 	{
-		char	   *curfile = (char *) lfirst(l);
-
-		unlink(curfile);
+		/*
+		 * On Windows systems we cannot handle ERROR_DELETE_PENDING
+		 * because GetLastError() returns ERROR_ACCESS_DENIED instead.
+		 * So rename the file first.
+		 */
+		char	   *tmpfile;
+		char	   *curfile = (char *)lfirst(l);
+		tmpfile = psprintf("%s.deleted", curfile);
+		if (rename(curfile, tmpfile) == 0) {
+			unlink(tmpfile);
+		}
+		else {
+			unlink(curfile);
+		}
+		pfree(tmpfile);
 		/* Should we complain if the unlink fails? */
 	}
 	/* Since we're about to exit, no need to reclaim storage */
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index ed2396aa6c..6b6705e0f2 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -31,6 +31,8 @@
 
 #ifdef WIN32					/* on Unix, we don't need libpq */
 #include "pqexpbuffer.h"
+#include "windows.h"
+#include "tlhelp32.h"
 #endif
 
 /* PID can be negative for standalone backend */
@@ -75,6 +77,11 @@ typedef enum
 
 static bool do_wait = true;
 static int	wait_seconds = DEFAULT_WAIT;
+#ifdef WIN32
+static int	windows_shell_wait_seconds = 1;
+#define		open(a,b,c) pgwin32_open(a,b,c) /* It is not necessary to include whole port.h */
+#define		fopen(a,b) pgwin32_fopen(a,b)
+#endif
 static bool wait_seconds_arg = false;
 static bool silent_mode = false;
 static ShutdownMode shutdown_mode = FAST_MODE;
@@ -110,6 +117,8 @@ static pid_t postmasterPID = -1;
 
 #define shutdownEvent	  shutdownHandles[0]
 #define postmasterProcess shutdownHandles[1]
+static pgpid_t	pgwin32_getppid(pgpid_t pid);
+static bool		pgwin32_parent_shell_is_alive(pid_t pid);
 #endif
 
 
@@ -867,6 +876,12 @@ do_stop(void)
 
 	pid = get_pgpid(false);
 
+#ifdef WIN32
+	/* On Windows systems we must additionaly check status of the processes after server stops */
+	pgpid_t		postmasters_shell_pid;
+	postmasters_shell_pid = pgwin32_getppid(pid);
+#endif
+
 	if (pid == 0)				/* no pid file */
 	{
 		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
@@ -934,6 +949,32 @@ do_stop(void)
 							   "waiting for session-initiated disconnection.\n"));
 			exit(1);
 		}
+
+#ifdef WIN32
+		/*
+		* On Windows systems postmaster's shell may still run after postmaster.pid file has deleted.
+		* This may lead to share locks on log_file between old and new cmd.exe when the server restarts.
+		* The share locks do not allow the new cmd.exe to run.
+		*/
+		if (postmasters_shell_pid > 0 && pgwin32_parent_shell_is_alive(postmasters_shell_pid))
+		{
+			print_msg(_("\npostmaster's shell is still alive. waiting for server to shut down again..."));
+			/* wait for stop */
+			for (cnt = 0; cnt < windows_shell_wait_seconds * WAITS_PER_SEC; cnt++)
+			{
+				if (pgwin32_parent_shell_is_alive(postmasters_shell_pid))
+				{
+					if (cnt % WAITS_PER_SEC == 0)
+						print_msg(".");
+					pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+				}
+				else
+					break;
+			}
+			/* Do not fail if the parent process is still running. */
+		}
+#endif
+
 		print_msg(_(" done\n"));
 
 		print_msg(_("server stopped\n"));
@@ -954,6 +995,12 @@ do_restart(void)
 
 	pid = get_pgpid(false);
 
+#ifdef WIN32
+	/* On Windows systems we must additionaly check status of the processes after server stops */
+	pgpid_t		postmasters_shell_pid;
+	postmasters_shell_pid = pgwin32_getppid(pid);
+#endif
+
 	if (pid == 0)				/* no pid file */
 	{
 		write_stderr(_("%s: PID file \"%s\" does not exist\n"),
@@ -1026,6 +1073,31 @@ do_restart(void)
 			exit(1);
 		}
 
+#ifdef WIN32
+		/*
+		* On Windows systems postmaster's shell may still run after postmaster.pid file has deleted.
+		* This may lead to share locks on log_file between old and new cmd.exe when the server restarts.
+		* The share locks do not allow the new cmd.exe to run.
+		*/
+		if (postmasters_shell_pid > 0 && pgwin32_parent_shell_is_alive(postmasters_shell_pid))
+		{
+			print_msg(_("\npostmaster's shell is still alive. waiting for server to shut down again..."));
+			/* wait for stop */
+			for (cnt = 0; cnt < windows_shell_wait_seconds * WAITS_PER_SEC; cnt++)
+			{
+				if (pgwin32_parent_shell_is_alive(postmasters_shell_pid))
+				{
+					if (cnt % WAITS_PER_SEC == 0)
+						print_msg(".");
+					pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+				}
+				else
+					break;
+			}
+			/* Do not fail if the parent process is still running. */
+		}
+#endif
+
 		print_msg(_(" done\n"));
 		print_msg(_("server stopped\n"));
 	}
@@ -1176,6 +1248,69 @@ do_promote(void)
  *	utility routines
  */
 
+#ifdef WIN32
+static pgpid_t
+pgwin32_getppid(pgpid_t pid)
+{
+	/* Finds parent process id. Returns 0 by default. */
+	HANDLE hSnapshot;
+	PROCESSENTRY32 pe32;
+	pgpid_t ppid = 0;
+
+	hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, (DWORD)pid);
+	__try{
+		if (hSnapshot == INVALID_HANDLE_VALUE) __leave;
+
+		ZeroMemory(&pe32, sizeof(pe32));
+		pe32.dwSize = sizeof(pe32);
+		if (!Process32First(hSnapshot, &pe32)) __leave;
+
+		do{
+			if (pe32.th32ProcessID == pid){
+				ppid = (pgpid_t)pe32.th32ParentProcessID;
+				break;
+			}
+		} while (Process32Next(hSnapshot, &pe32));
+
+	}
+	__finally{
+		if (hSnapshot != INVALID_HANDLE_VALUE) CloseHandle(hSnapshot);
+	}
+	return ppid;
+}
+
+static bool
+pgwin32_parent_shell_is_alive(pid_t pid)
+{
+	{
+		/* Checks the parent cmd.exe with specified pid is running */
+		HANDLE hSnapshot;
+		PROCESSENTRY32 pe32;
+		bool res = false;
+
+		hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, (DWORD)pid);
+		__try{
+			if (hSnapshot == INVALID_HANDLE_VALUE) __leave;
+
+			ZeroMemory(&pe32, sizeof(pe32));
+			pe32.dwSize = sizeof(pe32);
+			if (!Process32First(hSnapshot, &pe32)) __leave;
+
+			do{
+				if (pe32.th32ProcessID == pid && strcmp(pe32.szExeFile, "cmd.exe") == 0){
+					res = true;
+				}
+			} while (Process32Next(hSnapshot, &pe32));
+
+		}
+		__finally{
+			if (hSnapshot != INVALID_HANDLE_VALUE) CloseHandle(hSnapshot);
+		}
+		return res;
+	}
+}
+#endif
+
 static bool
 postmaster_is_alive(pid_t pid)
 {
-- 
2.14.1.windows.1

Reply via email to