Hi, Recently, one of our clients reported a problem that Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up while PostgreSQL 9.3.x service is starting up. My co-worker analyzed this and found that PostgreSQL's auxiliary process and Windows' logon processes are in a dead-lock situation.
Although this problem have been found only with PostgreSQL 9.3.x and Windows 10 in our client's environment for now, maybe the same problem occurs with other versions of PostgreSQL. He reported this problem to pgsql-general list as below. Also, he created a patch to add a build-time option for adding 0.5 or 3.0 seconds delay after each sub process starts. The attached is the same one. Our client confirmed that this patch resolves the dead-lock problem. Is it acceptable to add this option to PostgreSQL? Any comment would be appreciated. Regards, Begin forwarded message: Date: Fri, 29 Jun 2018 15:03:10 +0900 From: TAKATSUKA Haruka <haru...@sraoss.co.jp> To: pgsql-gene...@postgresql.org Subject: Windows 10 got stuck with PostgreSQL at starting up. Adding delay lets it avoid. I got a trouble in PostgreSQL 9.3.x on Windows 10. I would like to add new delay code as an official build option. Windows 10 sometime (approximately once in 300 tries) hung up at OS starting up. The logs say it happened while the PostgreSQL service was starting. When OS stopped, some postgres auxiliary process were started and some were not started yet. The Windows dump say some threads of the postgres auxiliary process are waiting OS level locks and the logon processes’thread are also waiting a lock. MS help desk said that PostgreSQL’s OS level deadlock caused OS freeze. I think it is strange story. But, in fact, it not happened in repeated tests when I got rid of PostgreSQL from the initial auto-starting services. I tweaked PostgreSQL 9.3.x (the newest from the repository) to add 0.5 or 3.0 seconds delay after each sub process starts. And then the hung up was gone. This test patch is attached. It is only implemented for Windows. Also, I did not use existing pg_usleep because it contains locking codes (e.g. WaitForSingleObject and Enter/LeaveCriticalSection). Although Windows OS may have some problems, I think we should have a means to avoid it. Can PostgreSQL be accepted such delay codes as build-time options by preprocessor variables? Thanks, Takatsuka Haruka -- Yugo Nagata <nag...@sraoss.co.jp>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index d6fc2ed..ff03ebd 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -398,6 +398,30 @@ extern int optreset; /* might not be declared by system headers */ static DNSServiceRef bonjour_sdref = NULL; #endif +#define USE_AFTER_AUX_FORK_SLEEP 3000 + +#ifdef USE_AFTER_AUX_FORK_SLEEP +#ifndef WIN32 +#define AFTER_AUX_FORK_SLEEP() +#else +#define AFTER_AUX_FORK_SLEEP() do { SleepEx(USE_AFTER_AUX_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_AUX_FORK_SLEEP() +#endif + +#define USE_AFTER_BACKEND_FORK_SLEEP 500 + +#ifdef USE_AFTER_BACKEND_FORK_SLEEP +#ifndef WIN32 +#define AFTER_BACKEND_FORK_SLEEP() +#else +#define AFTER_BACKEND_FORK_SLEEP() do { SleepEx(USE_AFTER_BACKEND_FORK_SLEEP, FALSE); } while(0) +#endif +#else +#define AFTER_BACKEND_FORK_SLEEP() +#endif + /* * postmaster.c - function prototypes */ @@ -1709,6 +1733,7 @@ ServerLoop(void) */ StreamClose(port->sock); ConnFree(port); + AFTER_BACKEND_FORK_SLEEP(); } } } @@ -2801,11 +2826,20 @@ reaper(SIGNAL_ARGS) * situation, some of them may be alive already. */ if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0) + { AutoVacPID = StartAutoVacLauncher(); + AFTER_AUX_FORK_SLEEP(); + } if (XLogArchivingActive() && PgArchPID == 0) + { PgArchPID = pgarch_start(); + AFTER_AUX_FORK_SLEEP(); + } if (PgStatPID == 0) + { PgStatPID = pgstat_start(); + AFTER_AUX_FORK_SLEEP(); + } /* some workers may be scheduled to start now */ maybe_start_bgworker(); @@ -5259,6 +5293,7 @@ StartChildProcess(AuxProcType type) /* * in parent, successful fork */ + AFTER_AUX_FORK_SLEEP(); return pid; }