Re: FW: [PATCH] fix child reclaim timing

Noah Arliss Tue, 17 Aug 2004 09:24:00 -0700

I took a look at the worker and prefork mpms and it seems that
make_child registers the just_die function for SIGTERM for child
processes. Now just_die does try to do a gracefull shutdown by calling
clean_child_exit, but what happens if clean_child_exit gets called
twice?


Hopefully I'm just reading the code wrong and don't fully understand
the MPM's yet. As for the timing you've described below, if SIGTERM is
really safe to send multiple times to a child process, I like it much
better as well.

-Noah

 -----Original Message-----
 From: Jeff Trawick [mailto:[EMAIL PROTECTED]
 Sent: Friday, August 13, 2004 6:26 PM
 To: [EMAIL PROTECTED]
 Subject: Re: [PATCH] fix child reclaim timing
 
 On Fri, 13 Aug 2004 16:48:42 -0400, Arliss, Noah <[EMAIL PROTECTED]>
 wrote:
 > I'd like to comment further... Not only is a disturbing message sent to
 > the
 > error log, but a SIGTERM is also sent to the child process. If I
 > understand
 > correctly the SIGTERM will likely interrupt any properly implemented child
 > process shutdown and the child process will exit ungracefully.
 
for worker MPM, at least:
 
child processes have a SIGTERM handler that simply sets a flag and
returns to whatever was happening before; it will be the main thread
of a child that receives a message via another mechanism which tells
it to wake up and decide to exit
 
the SIGTERM isn't expected to interrupt any important processing going
on in the child (be it worker threads or child exit hook)
 
SIGTERM is sent multiple times to work around any signal loss or other
glitch (not sure when this is effective in reality); I don't see how
it is harmful to any code that must run
 
the SIGKILL is what yanks the rug out from under the child and any
child exit hooks; the web server simply must exit in a reasonable
timeframe if the administrator tells it too, stuck code or not
 
>                   If it's
> acceptable to wait longer then the kill call should also be postponed to
> give modules a chance to cleanup gracefully. If any module has complex IPC
> or Mutexes in use, graceful shutdown is important especially if
> MaxRequestsPerChild is in use on a server with heavy load.
> 
yes, the SIGKILL is the measure of last resort; shouldn't be sent for
a while after we start shutting down
> 
here is a current example:
 
(I don't actually know when shutdown started; I should add a debug msg
there; but it is very short time before this uninteresting mess
starts)
[Mon Jun 14 09:15:11 2004] [warn] child process 3906 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3907 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3924 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3925 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3926 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3906 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3907 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3924 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3925 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:12 2004] [warn] child process 3926 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:13 2004] [warn] child process 3906 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:13 2004] [warn] child process 3907 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:13 2004] [warn] child process 3924 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:13 2004] [warn] child process 3925 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:13 2004] [warn] child process 3926 still did not
exit, sending a SIGTERM
[Mon Jun 14 09:15:14 2004] [error] [client 127.0.0.1] request failed:
error reading the headers
[Mon Jun 14 09:15:15 2004] [info] (9)Bad file number:
core_output_filter: writing data to the network
[Mon Jun 14 09:15:17 2004] [error] child process 3906 still did not
exit, sending a SIGKILL
[Mon Jun 14 09:15:34 2004] [info] removed PID file
/export/home/trawick/inst/20/logs/httpd.pid (pid=3903)
[Mon Jun 14 09:15:34 2004] [notice] caught SIGTERM, shutting down
 
if SIGTERM simply sets a flag and returns, what is use of repeating
the SIGTERM over and over?  for worker MPM it doesn't help or hurt
AFAICT; worker does something else to wake up its children prior to
calling the code Joe has a patch for
 
this sounds a bit more sane to me for timing, as long as we can exit
as soon as all children have exited:
 
shutdown + 0:
send SIGTERM
shutdown + 4:
 for each child still remaining, bitch to error log and send SIGTERM again
shutdown + 8:
for each child still remaining, bitch to error log and send SIGTERM again
shutdown + 12:
for each child still remaining, bitch to error log and send SIGKILL
shutdown + 16:
for each child still remaining, bitch to error log, send SIGKILL again,
and exit anyway
 
if somebody suspects that sending SIGTERM every second is going to
help some MPM+platform, that would be great to know

Re: FW: [PATCH] fix child reclaim timing

Reply via email to