subject:"Re\: shutdown and linux poll\(\)"

Re: shutdown and linux poll()

2006-02-20 Thread Chris Darroch

Hi --

   I've crafted what seems to me like a reasonably minimal set of
patches to deal with the issue I described in this thread:

http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=113986864730305&w=2

   The crux of the problem is that on Linux, when using httpd with
the worker MPM (and probably the event MPM too), hard restarts and
shutdowns often end up sending SIGKILL to httpd child processes
because those processes are waiting for their worker threads to
finish polling on Keep-Alive connections.

   Apparently, on most OSes, if one thread closes a socket descriptor
then other threads polling on it immediately get a return value.
This certainly seems to be the case on Solaris.  But on Linux,
worker threads polling on their sockets in apr_wait_for_io_or_timeout()
don't get an error return value until the full (usually 15 second)
Keep-Alive timeout period is up.  The main httpd process deems that
too long, and issues SIGKILL to the child processes.

   For me personally, the consequence is that all my nice cleanup
handlers registered against the memory pool that's passed during the
child_init stage never get called.  This is particularly painful
if one is hoping to, for example, cleanly shut down DB connections
that one has opened with mod_dbd/apr_dbd.  In the case of mod_dbd,
it opens its reslist of apr_dbd connections against the pool
passed in the child_init stage, which with the worker MPM is its
pchild pool.  When SIGKILL is applied, the apr_pool_destroy(pchild)
call is often not reached, so DB disconnections don't occur;
even if I'm trying to shut down httpd in a hurry, I don't really
want that to happen if at all possible.

   Without further ado, then, my initial patches.  These are
Unix-only at the moment; I have little experience with other OSes.
If anyone wants to propose something better, and/or suggest
changes, that would be superb.  In the meantime, since these
work for me, I'll start applying them against APR and httpd for
my own use.

   First, the APR patches (against trunk):

===
--- include/apr_network_io.h.orig   2006-02-20 16:20:44.841609000
-0500
+++ include/apr_network_io.h2006-02-20 16:24:19.99359 -0500

@@ -99,6 +99,7 @@

 * until data is available.

 * @see apr_socket_accept_filter

 */

+#define APR_INTERRUPT_WAIT  65536 /**< Return from IO wait on interrupt
*/

 /** @} */


--- network_io/unix/sockopt.c.orig  2006-02-17 11:24:13.058691778 -0500
+++ network_io/unix/sockopt.c   2006-02-17 11:28:08.910410867 -0500

@@ -318,6 +318,9 @@

 return APR_ENOTIMPL;

 #endif

 break;

+case APR_INTERRUPT_WAIT:

+apr_set_option(sock, APR_INTERRUPT_WAIT, on);

+break;

 default:

 return APR_EINVAL;

 }

--- support/unix/waitio.c.orig  2005-07-09 03:07:17.0 -0400
+++ support/unix/waitio.c   2006-02-17 11:23:42.620856949 -0500

@@ -49,7 +49,8 @@


 do {

 rc = poll(&pfd, 1, timeout);

-} while (rc == -1 && errno == EINTR);

+} while (rc == -1 && errno == EINTR &&

+ (f || !apr_is_option_set(s, APR_INTERRUPT_WAIT)));

 if (rc == 0) {

 return APR_TIMEUP;

 }

===

   Second, the httpd patches (also against trunk):

===
--- server/mpm/worker/worker.c.orig 2006-02-20 16:26:55.302701000 -0500
+++ server/mpm/worker/worker.c  2006-02-20 16:46:44.764980568 -0500
@@ -213,6 +213,19 @@
  */
 #define LISTENER_SIGNAL SIGHUP

+/* The WORKER_SIGNAL signal will be sent from the main thread to the
+ * worker threads after APR_INTERRUPT_WAIT is set true on their sockets.
+ * This ensures that on systems (i.e., Linux) where closing the worker
+ * socket doesn't awake the worker thread when it is polling on the socket
+ * (especially after in apr_wait_for_io_or_timeout() when handling
+ * Keep-Alive connections), close_worker_sockets() and join_workers()
+ * still function in timely manner and allow ungraceful shutdowns to
+ * proceed to completion.  Otherwise join_workers() doesn't return
+ * before the main process decides the child process is non-responsive
+ * and sends a SIGKILL.
+ */
+#define WORKER_SIGNAL   AP_SIG_GRACEFUL
+
 /* An array of socket descriptors in use by each thread used to
  * perform a non-graceful (forced) shutdown of the server. */
 static apr_socket_t **worker_sockets;
@@ -222,6 +235,7 @@
 int i;
 for (i = 0; i < ap_threads_per_child; i++) {
 if (worker_sockets[i]) {
+apr_socket_opt_set(worker_sockets[i], APR_INTERRUPT_WAIT, 1);
 apr_socket_close(worker_sockets[i]);
 worker_sockets[i] = NULL;
 }
@@ -822,6 +836,11 @@
 ap_scoreboard_image->servers[process_slot][thread_slot].generation
= ap_my_generation;
 ap_update_child_status_from_indexes(process_slot, thread_slot,
SERVER_STAR

Re: shutdown and linux poll()

2006-02-14 Thread Chris Darroch

Hi --

>>Does anyone have any advice?  Does this seem like a problem
>> to be addressed?  I tried to think through how one could signal
>> the poll()ing worker threads with pthread_kill(), but it seems
>> to me that not only would you have to have a signal handler
>> in the worker threads (not hard), you'd somehow have to break
>> out of whatever APR wrappers are abstracting the poll() once
>> the handler set its flag or whatever and returned -- the APR
>> functions can't just loop on EINTR anymore.  (Is it
>> socket_bucket_read() in the socket bucket code and then
>> apr_socket_recv()?  I can't quite tell yet.)  Anyway, it seemed
>> complex and likely to break the abstraction across OSes.
>> 
>>Still, I imagine I'm not the only one who would really like
>> those worker threads to cleanly exit so everything else does ...
>> after all, they're not doing anything critical, just waiting
>> for the Keep-Alive timeout to expire, after which they notice
>> their socket is borked and exit.

Paul Querna wrote:

> To clarify, are you sure its not using EPoll instead of Poll?

   The culprit is the poll() inside apr_wait_for_io_or_timeout(),
which is indeed being called from within apr_socket_recv().  The
stack is, basically:

apr_wait_for_io_or_timeout()
apr_socket_recv()
socket_bucket_read()
apr_bucket_read()
ap_rgetline_core()
ap_rgetline()
read_request_line()
ap_read_request()
ap_process_http_connection()

   Here's the tail of my strace, after I hacked on waitio.c to
spit out a write() just before and after polling:

11:47:21.757774 write(15, "about to poll with timeout 15000\n", 33) = 33
11:47:21.757877 close(15)   = 0
11:47:21.757943 munmap(0xb7fff000, 4096) = 0
11:47:21.758016 poll([{fd=14, events=POLLIN, revents=POLLNVAL}], 1,
15000) = 1
11:47:33.261025 +++ killed by SIGKILL +++

   I'd really love to hear opinions on this.  Would anyone like
a patch to make ap_reclaim_child_processes() to wait first for the
maximum configured Keep-Alive period?

   If that's too hacky, then what's the consensus -- ignore the
issue, or try to invent a way for the worker (and event?) MPMs
to signal their worker threads?  It would seem to me that,
other than major surgery on APR, the ideal would be for the
signal handler to perform this snippet from the tail of worker_thread():

ap_update_child_status_from_indexes(process_slot, thread_slot,
(dying) ? SERVER_DEAD : SERVER_GRACEFUL, (request_rec *) NULL);

apr_thread_exit(thd, APR_SUCCESS);

or at a bare minimum, the apr_thread_exit().  But I'm not sure
offhand if having signal handlers perform thread exits is possible;
I feel like it's verboten

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B

Re: shutdown and linux poll()

2006-02-13 Thread Chris Darroch

Paul:

>>This may be an old topic of conversation, in which case I apologize.
>> I Googled and searched marc.theaimslist.com and Apache Bugzilla but
>> didn't see anything, so here I am with a question.
>>
>>In brief, on Linux, when doing an ungraceful stop of httpd, any
>>  worker threads that are poll()ing on Keep-Alive connections don't get
>> awoken by close_worker_sockets() and that can lead to the process
>> getting the SIGKILL signal without ever getting the chance to run
>> apr_pool_destroy(pchild) in clean_child_exit().  This seems to
>> relate to this particular choice by the Linux and/or glibc folks:
>>
>> http://bugme.osdl.org/show_bug.cgi?id=546

> To clarify, are you sure its not using EPoll instead of Poll?

   Well, I'll probe more deeply tomorrow, and while I'm no expert
on this stuff, I don't think so.  Here are the last two lines from
an strace on one of the worker threads:

21:39:30.955670 poll([{fd=13, events=POLLIN, revents=POLLNVAL}], 1,
15000) = 1
21:39:42.257615 +++ killed by SIGKILL +++

   That's the poll() on descriptor 13, for 15 keep-alive seconds,
during which the main process decides to do the SIGKILL.  Here,
I think, is the accept() that opens fd 13:

21:38:51.017764 accept(3, {sa_family=AF_INET, sin_port=htons(63612),
  sin_addr=inet_addr("xxx.xxx.xxx.xxx")}, [16]) = 13

and while I do see some epoll stuff, it's on another descriptor:

21:38:43.012242 epoll_create(1) = 12

   Now, the caveat here is that I'm learning as I go; sockets
are not really my strong point.  But it's fairly easy to reproduce
this behaviour with a stock Apache 2.0 or 2.2 on a RedHat system;
I've tried both.  I can certainly provide more details if requested;
let me know!  Thanks,

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B

Re: shutdown and linux poll()

2006-02-13 Thread Paul Querna


To clarify, are you sure its not using EPoll instead of Poll?


Chris Darroch wrote:

Hi --

   This may be an old topic of conversation, in which case I apologize.
I Googled and searched marc.theaimslist.com and Apache Bugzilla but
didn't see anything, so here I am with a question.

   In brief, on Linux, when doing an ungraceful stop of httpd, any
 worker threads that are poll()ing on Keep-Alive connections don't get
awoken by close_worker_sockets() and that can lead to the process
getting the SIGKILL signal without ever getting the chance to run
apr_pool_destroy(pchild) in clean_child_exit().  This seems to
relate to this particular choice by the Linux and/or glibc folks:

http://bugme.osdl.org/show_bug.cgi?id=546


   The backstory goes like this: I spent a chunk of last week trying
to figure out why my module wasn't shutting down properly.  First I
found some places in my code where I'd failed to anticipate the order
in which memory pool cleanup functions would be called, especially
those registered by apr_thread_cond_create().

   However, after fixing that, I found that when connections were still
in the 15 second timeout for Keep-Alives, a child process could get the
SIGKILL before finished cleaning up.  (I'm using httpd 2.2.0 with the
worker MPM on Linux 2.6.9 [RHEL 4] with APR 1.2.2.)  The worker threads
are poll()ing and, if I'm reading my strace files correctly, they don't
get an EBADF until after the timeout completes.  That means that
join_workers() is waiting for those threads to exit, so child_main()
can't finish up and call clean_child_exit() and thus apr_pool_destroy()
on the pchild memory pool.

   This is a bit of a problem for me because I really need
join_workers() to finish up and the cleanups I've registered
against pchild in my module's child_init handler to be run if
at all possible.

   It was while researching all this that I stumbled on the amazing
new graceful-stop feature and submitted #38621, which I see has
already been merged ... thank you!

   However, if I need to do an ungraceful stop of the server --
either manually or because the GracefulShutdownTimeout has
expired without a chance to gracefully stop -- I'd still like my
cleanups to run.


   My solution at the moment is a pure hack -- I threw in
apr_sleep(apr_time_from_sec(15)) right before
ap_reclaim_child_processes(1) in ap_mpm_run() in worker.c.
That way it lets all the Keep-Alive timeouts expire before
applying the SIGTERM/SIGKILL hammer.  But that doesn't seem
ideal, and moreover, doesn't take into account the fact that
KeepAliveTimeouts > 15 seconds may have been assigned.  Even
if I expand my hack to wait for the maximum possible Keep-Alive
timeout, it's still clearly a hack.


   Does anyone have any advice?  Does this seem like a problem
to be addressed?  I tried to think through how one could signal
the poll()ing worker threads with pthread_kill(), but it seems
to me that not only would you have to have a signal handler
in the worker threads (not hard), you'd somehow have to break
out of whatever APR wrappers are abstracting the poll() once
the handler set its flag or whatever and returned -- the APR
functions can't just loop on EINTR anymore.  (Is it
socket_bucket_read() in the socket bucket code and then
apr_socket_recv()?  I can't quite tell yet.)  Anyway, it seemed
complex and likely to break the abstraction across OSes.

   Still, I imagine I'm not the only one who would really like
those worker threads to cleanly exit so everything else does ...
after all, they're not doing anything critical, just waiting
for the Keep-Alive timeout to expire, after which they notice
their socket is borked and exit.

   FWIW, I tested httpd 2.2.0 with the worker MPM on a Solaris
2.9 box and it does indeed do what the Linux "bug" report says;
poll() returns immediately if another thread closes the socket
and thus the whole httpd server exits right away.

   Thoughts, advice?  Any comments appreciated.

Chris.

Re: shutdown and linux poll()

Re: shutdown and linux poll()

Re: shutdown and linux poll()

Re: shutdown and linux poll()

4 matches

Site Navigation

Mail list logo

Footer information