MauMau escribió:
> From: "Alvaro Herrera" <alvhe...@2ndquadrant.com>

> >Actually, in further testing I noticed that the fast-path you introduced
> >in BackendCleanup (or was it HandleChildCrash?) in the immediate
> >shutdown case caused postmaster to fail to clean up properly after
> >sending the SIGKILL signal, so I had to remove that as well.  Was there
> >any reason for that?
> 
> You are talking about the code below at the beginning of
> HandleChildCrash(), aren't you?
> 
> /* Do nothing if the child terminated due to immediate shutdown */
> if (Shutdown == ImmediateShutdown)
>  return;
> 
> If my memory is correct, this was necessary; without this,
> HandleChildCrash() or LogChildExit() left extra messages in the
> server log.
> 
> LOG:  %s (PID %d) exited with exit code %d
> LOG:  terminating any other active server processes
> 
> These messages are output because postmaster sends SIGQUIT to its
> children and wait for them to terminate.  The children exit with
> non-zero status when they receive SIGQUIT.

Yeah, I see that --- after removing that early exit, there are unwanted
messages.  And in fact there are some signals sent that weren't
previously sent.  Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?


Noah Misch escribió:
> On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote:

> > the clients at the immediate shutdown.  The code gets much simpler.  In  
> > addition, it may be better that we similarly send SIGKILL in backend 
> > crash (FatalError) case, eliminate the use of SIGQUIT and remove 
> > quickdie() and other SIGQUIT handlers.
> 
> My take is that the client message has some value, and we shouldn't just
> discard it to simplify the code slightly.  Finishing the shutdown quickly has
> value, of course.  The relative importance of those things should guide the
> choice of a timeout under method #2, but I don't see a rigorous way to draw
> the line.  I feel five seconds is, if anything, too high.

There's obviously a lot of disagreement on 5 seconds being too high or
too low.  We have just followed SysV init's precedent of waiting 5
seconds by default between sending signals TERM and QUIT during a
shutdown.  I will note that during a normal shutdown, services are
entitled to do much more work than just signal all their children to
exit immediately; and yet I don't find much evidence that this period is
inordinately short.  I don't feel strongly that it couldn't be shorter,
though.

> What about using deadlock_timeout?  It constitutes a rough barometer on the
> system's tolerance for failure detection latency, and we already overload it
> by having it guide log_lock_waits.  The default of 1s makes sense to me for
> this purpose, too.  We can always add a separate immediate_shutdown_timeout if
> there's demand, but I don't expect such demand.  (If we did add such a GUC,
> setting it to zero could be the way to request method 1.  If some folks
> strongly desire method 1, that's probably the way to offer it.)

I dunno.  Having this be configurable seems overkill to me.  But perhaps
that's the way to satisfy most people: some people could set it very
high so that they could have postmaster wait longer if they believe
their server is going to be overloaded; people wishing immediate SIGKILL
could get that too, as you describe.

I think this should be a separate patch, however.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
*** a/doc/src/sgml/runtime.sgml
--- b/doc/src/sgml/runtime.sgml
***************
*** 1362,1372 **** echo -1000 > /proc/self/oom_score_adj
       <listitem>
        <para>
        This is the <firstterm>Immediate Shutdown</firstterm> mode.
!       The master <command>postgres</command> process will send a
!       <systemitem>SIGQUIT</systemitem> to all child processes and exit
!       immediately, without properly shutting itself down. The child processes
!       likewise exit immediately upon receiving
!       <systemitem>SIGQUIT</systemitem>. This will lead to recovery (by
        replaying the WAL log) upon next start-up. This is recommended
        only in emergencies.
        </para>
--- 1362,1372 ----
       <listitem>
        <para>
        This is the <firstterm>Immediate Shutdown</firstterm> mode.
!       The server will send <systemitem>SIGQUIT</systemitem> to all child
!       processes and wait for them to terminate.  Those that don't terminate
!       within 5 seconds, will be sent <systemitem>SIGKILL</systemitem> by the
!       master <command>postgres</command> process, which will then terminate
!       without further waiting.  This will lead to recovery (by
        replaying the WAL log) upon next start-up. This is recommended
        only in emergencies.
        </para>
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 275,280 **** static pid_t StartupPID = 0,
--- 275,281 ----
  #define			NoShutdown		0
  #define			SmartShutdown	1
  #define			FastShutdown	2
+ #define			ImmediateShutdown	3
  
  static int	Shutdown = NoShutdown;
  
***************
*** 345,350 **** typedef enum
--- 346,355 ----
  
  static PMState pmState = PM_INIT;
  
+ /* Start time of abort processing at immediate shutdown or child crash */
+ static time_t AbortStartTime;
+ #define SIGKILL_CHILDREN_AFTER_SECS		5
+ 
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
***************
*** 421,426 **** static void RandomSalt(char *md5Salt);
--- 426,432 ----
  static void signal_child(pid_t pid, int signal);
  static bool SignalSomeChildren(int signal, int targets);
  static bool SignalUnconnectedWorkers(int signal);
+ static void TerminateChildren(int signal);
  
  #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
  
***************
*** 1427,1434 **** DetermineSleepTime(struct timeval * timeout)
  	if (Shutdown > NoShutdown ||
  		(!StartWorkerNeeded && !HaveCrashedWorker))
  	{
! 		timeout->tv_sec = 60;
! 		timeout->tv_usec = 0;
  		return;
  	}
  
--- 1433,1448 ----
  	if (Shutdown > NoShutdown ||
  		(!StartWorkerNeeded && !HaveCrashedWorker))
  	{
! 		if (AbortStartTime > 0)
! 		{
! 			timeout->tv_sec = SIGKILL_CHILDREN_AFTER_SECS;
! 			timeout->tv_usec = 0;
! 		}
! 		else
! 		{
! 			timeout->tv_sec = 60;
! 			timeout->tv_usec = 0;
! 		}
  		return;
  	}
  
***************
*** 1660,1665 **** ServerLoop(void)
--- 1674,1701 ----
  			TouchSocketLockFiles();
  			last_touch_time = now;
  		}
+ 
+ 		/*
+ 		 * If we already sent SIGQUIT to children and they are slow to shut
+ 		 * down, it's time to send them SIGKILL.  This doesn't happen normally,
+ 		 * but under certain conditions backends can get stuck while shutting
+ 		 * down.  This is a last measure to get them unwedged.
+ 		 *
+ 		 * Note we also do this during recovery from a process crash.
+ 		 */
+ 		if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
+ 			now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)
+ 		{
+ 			/* We were gentle with them before. Not anymore */
+ 			TerminateChildren(SIGKILL);
+ 
+ 			/*
+ 			 * Additionally, unless we're recovering from a process crash, it's
+ 			 * now the time for postmaster to abandon ship.
+ 			 */
+ 			if (!FatalError)
+ 				ExitPostmaster(1);
+ 		}
  	}
  }
  
***************
*** 2455,2484 **** pmdie(SIGNAL_ARGS)
  			/*
  			 * Immediate Shutdown:
  			 *
! 			 * abort all children with SIGQUIT and exit without attempt to
! 			 * properly shut down data base system.
  			 */
  			ereport(LOG,
  					(errmsg("received immediate shutdown request")));
! 			SignalChildren(SIGQUIT);
! 			if (StartupPID != 0)
! 				signal_child(StartupPID, SIGQUIT);
! 			if (BgWriterPID != 0)
! 				signal_child(BgWriterPID, SIGQUIT);
! 			if (CheckpointerPID != 0)
! 				signal_child(CheckpointerPID, SIGQUIT);
! 			if (WalWriterPID != 0)
! 				signal_child(WalWriterPID, SIGQUIT);
! 			if (WalReceiverPID != 0)
! 				signal_child(WalReceiverPID, SIGQUIT);
! 			if (AutoVacPID != 0)
! 				signal_child(AutoVacPID, SIGQUIT);
! 			if (PgArchPID != 0)
! 				signal_child(PgArchPID, SIGQUIT);
! 			if (PgStatPID != 0)
! 				signal_child(PgStatPID, SIGQUIT);
! 			SignalUnconnectedWorkers(SIGQUIT);
! 			ExitPostmaster(0);
  			break;
  	}
  
--- 2491,2517 ----
  			/*
  			 * Immediate Shutdown:
  			 *
! 			 * abort all children with SIGQUIT, wait for them to exit,
! 			 * terminate remaining ones with SIGKILL, then exit without
! 			 * attempt to properly shut down the data base system.
  			 */
+ 			if (Shutdown >= ImmediateShutdown)
+ 				break;
+ 			Shutdown = ImmediateShutdown;
  			ereport(LOG,
  					(errmsg("received immediate shutdown request")));
! 
! 			TerminateChildren(SIGQUIT);
! 			pmState = PM_WAIT_BACKENDS;
! 
! 			/* set stopwatch for them to die */
! 			AbortStartTime = time(NULL);
! 
! 			/*
! 			 * Now wait for backends to exit.  If there are none,
! 			 * PostmasterStateMachine will take the next step.
! 			 */
! 			PostmasterStateMachine();
  			break;
  	}
  
***************
*** 3180,3185 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
--- 3213,3224 ----
  		pmState == PM_WAIT_READONLY ||
  		pmState == PM_SHUTDOWN)
  		pmState = PM_WAIT_BACKENDS;
+ 
+ 	/*
+ 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
+ 	 * for us to kill them without mercy.
+ 	 */
+ 	AbortStartTime = time(NULL);
  }
  
  /*
***************
*** 3316,3322 **** PostmasterStateMachine(void)
  			WalWriterPID == 0 &&
  			AutoVacPID == 0)
  		{
! 			if (FatalError)
  			{
  				/*
  				 * Start waiting for dead_end children to die.	This state
--- 3355,3361 ----
  			WalWriterPID == 0 &&
  			AutoVacPID == 0)
  		{
! 			if (Shutdown >= ImmediateShutdown || FatalError)
  			{
  				/*
  				 * Start waiting for dead_end children to die.	This state
***************
*** 3326,3332 **** PostmasterStateMachine(void)
  
  				/*
  				 * We already SIGQUIT'd the archiver and stats processes, if
! 				 * any, when we entered FatalError state.
  				 */
  			}
  			else
--- 3365,3372 ----
  
  				/*
  				 * We already SIGQUIT'd the archiver and stats processes, if
! 				 * any, when we started immediate shutdown or entered
! 				 * FatalError state.
  				 */
  			}
  			else
***************
*** 3511,3516 **** signal_child(pid_t pid, int signal)
--- 3551,3557 ----
  		case SIGTERM:
  		case SIGQUIT:
  		case SIGSTOP:
+ 		case SIGKILL:
  			if (kill(-pid, signal) < 0)
  				elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) (-pid), signal);
  			break;
***************
*** 3598,3603 **** SignalSomeChildren(int signal, int target)
--- 3639,3671 ----
  }
  
  /*
+  * Send a termination signal to children.  This considers all of our children
+  * processes, except syslogger and dead_end backends.
+  */
+ static void
+ TerminateChildren(int signal)
+ {
+ 	SignalChildren(signal);
+ 	if (StartupPID != 0)
+ 		signal_child(StartupPID, signal);
+ 	if (BgWriterPID != 0)
+ 		signal_child(BgWriterPID, signal);
+ 	if (CheckpointerPID != 0)
+ 		signal_child(CheckpointerPID, signal);
+ 	if (WalWriterPID != 0)
+ 		signal_child(WalWriterPID, signal);
+ 	if (WalReceiverPID != 0)
+ 		signal_child(WalReceiverPID, signal);
+ 	if (AutoVacPID != 0)
+ 		signal_child(AutoVacPID, signal);
+ 	if (PgArchPID != 0)
+ 		signal_child(PgArchPID, signal);
+ 	if (PgStatPID != 0)
+ 		signal_child(PgStatPID, signal);
+ 	SignalUnconnectedWorkers(signal);
+ }
+ 
+ /*
   * BackendStartup -- start backend process
   *
   * returns: STATUS_ERROR if the fork failed, STATUS_OK otherwise.
*** a/src/port/kill.c
--- b/src/port/kill.c
***************
*** 38,43 **** pgkill(int pid, int sig)
--- 38,63 ----
  		errno = EINVAL;
  		return -1;
  	}
+ 
+ 	/* special case for SIGKILL: just ask the system to terminate the target */
+ 	if (sig == SIGKILL)
+ 	{
+ 		HANDLE prochandle;
+ 
+ 		if ((prochandle = OpenProcess(PROCESS_TERMINATE, FALSE, (DWORD) pid)) == NULL)
+ 		{
+ 			errno = ESRCH;
+ 			return -1;
+ 		}
+ 		if (!TerminateProcess(prochandle, 255))
+ 		{
+ 			_dosmaperr(GetLastError());
+ 			CloseHandle(prochandle);
+ 			return -1;
+ 		}
+ 		CloseHandle(prochandle);
+ 		return 0;
+ 	}
  	snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u", pid);
  
  	if (CallNamedPipe(pipename, &sigData, 1, &sigRet, 1, &bytes, 1000))
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to