Thank you so much for the reply. Here are a couple of examples, as I'm not completely sure if my symptoms match, though the pstacks do look very similar to my untrained eye:
Here is a two day-old child: 27743: /usr/local/apache2/bin/httpd -k start ----------------- lwp# 1 / thread# 1 -------------------- ff00a42c lwp_wait (3, ffbff804) ff001e88 _thrp_join (3, 0, ffbff86c, 1, ff0b2780, ffbff804) + 38 ff214544 apr_thread_join (ffbff8ec, 32eea8, 7, 0, dc328, b15e0) + c 0008c43c join_workers (0, fe3aa8, 8bfcc, 32ec30, 0, 1) + ec 0008c790 child_main (2, 8b31c, 0, feee2a40, ff0b2840, ff0b2780) + 270 0008c970 make_child (c7800, 2, 0, c8800, c7000, c8400) + 128 0008d1b4 ap_mpm_run (fe4100f8, e, 0, 1, 27, 1) + 754 000343c0 main (d6218, d8190, ffbffc54, c7800, c7800, 0) + 79c 00033754 _start (0, 0, 0, 0, 0, 0) + 5c ----------------- lwp# 3 / thread# 3 -------------------- ff0058d4 lwp_park (0, 0, 0) fefff6e8 cond_wait_queue (32ecc8, 32ec98, 0, 0, 0, 0) + 4c fefffd30 cond_wait (32ecc8, 32ec98, 0, 0, fe460a40, 0) + 10 fefffd6c pthread_cond_wait (32ecc8, 32ec98, 0, 0, 32ec98, 0) + 8 0008e674 ap_queue_pop (32ec78, fe30bf1c, fe30bf18, 4, 0, 32ee40) + 64 0008be1c worker_thread (32eea8, 2, fe460a40, c8400, c8400, 0) + 10c ff21440c dummy_worker (32eea8, 0, 0, fe460a40, ff214400, 1) + c ff005850 _lwp_start (0, 0, 0, 0, 0, 0) ----------------- lwp# 4 / thread# 4 -------------------- ff0058d4 lwp_park (0, 0, 0) fefff6e8 cond_wait_queue (32ecc8, 32ec98, 0, 0, 0, 0) + 4c fefffd30 cond_wait (32ecc8, 32ec98, 0, 0, fe461240, 11692d8) + 10 fefffd6c pthread_cond_wait (32ecc8, 32ec98, 0, 0, 32ec98, 0) + 8 0008e674 ap_queue_pop (32ec78, fe20bf1c, fe20bf18, 0, 0, 32ee40) + 64 0008be1c worker_thread (32eec8, 2, fe461240, c8400, c8400, 4) + 10c ff21440c dummy_worker (32eec8, 0, 0, fe461240, ff214400, 1) + c ff005850 _lwp_start (0, 0, 0, 0, 0, 0) ...and several more in lwp_park. And here's another one that's a day old, but looks different (including lots of jk references): 7934: /usr/local/apache2/bin/httpd -k start ----------------- lwp# 1 / thread# 1 -------------------- ff00a42c lwp_wait (6, ffbff80c) ff001e88 _thrp_join (6, 0, ffbff874, 1, ff0b2780, ffbff80c) + 38 ff214544 apr_thread_join (ffbff8f4, 28e228, 2, 0, 1, b1600) + c 0008c43c join_workers (c, 3c5f38, 8bfcc, 28df50, 0, 1) + ec 0008c790 child_main (0, 8b31c, 0, feee2a40, ff0b2840, ff0b2780) + 270 0008c970 make_child (c7800, 0, 0, c8800, c7000, c8400) + 128 0008d1b4 ap_mpm_run (fe4100f8, e, 0, 1, 26, 1) + 754 000343c0 main (d6218, d8190, ffbffc5c, c7800, c7800, 0) + 79c 00033754 _start (0, 0, 0, 0, 0, 0) + 5c ----------------- lwp# 6 / thread# 6 -------------------- ff00a14c read (15, fe00a908, 4) fe4a87dc jk_tcp_socket_recvfull (15, fe00a908, 4, 2e4bf8, 510, 4ec) + 74 fe4c3088 ajp_connection_tcp_get_message (35f130, 35f168, 2e4bf8, 361188, 2000, 2064) + 44 fe4c5588 ajp_get_reply (361168, fe00bb50, 2e4bf8, 35f130, fe00aa70, 2028) + 9c fe4c9304 ajp_service (361168, fe00bb50, 2e4bf8, fe00ab38, 1, c00) + 22b8 fe4a1234 jk_handler (23c, 35e740, 3f4390, 1, 13, 3544c8) + 9e4 00047534 ap_run_handler (3f40a0, 0, 11, 3e7028, 3f5a08, 0) + 3c 000479c0 ap_invoke_handler (3f40a0, 9d000, 3f40a0, 0, fe410028, 0) + c0 00073aa4 ap_process_request (3f40a0, 3, 4, 3f40a0, c8420, 21d8d8) + 160 00070b34 ap_process_http_connection (3d52e8, 3d5038, 3d5038, 3, c8420, 211980) + 10c 0004dce8 ap_run_process_connection (3d52e8, 3d5038, 3d5038, 3, 3d52e0, 3d7068) + 3c 0008bf1c worker_thread (28e228, 0, fe462240, c8400, c8400, c) + 20c ff21440c dummy_worker (28e228, 0, 0, fe462240, ff214400, 1) + c ff005850 _lwp_start (0, 0, 0, 0, 0, 0) ----------------- lwp# 7 / thread# 7 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 8 / thread# 8 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 9 / thread# 9 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 10 / thread# 10 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 11 / thread# 11 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 12 / thread# 12 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 13 / thread# 13 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ----------------- lwp# 14 / thread# 14 -------------------- ff214400 dummy_worker(), exit value = 0x00000000 ** zombie (exited, not detached, not yet joined) ** ...and so on... If anyone has the time to confirm my case is a match I'd be very grateful but this patch looks promising! Thank you VERY MUCH! -Chris -----Original Message----- From: Rainer Jung [mailto:rainer.j...@kippdata.de] Sent: Saturday, June 22, 2013 12:31 PM To: users@tomcat.apache.org Subject: Re: Abandoned apache children with mod_jk On 21.06.2013 19:47, Chris Boyce wrote: > Hello, > > I'm running apache 2.2.24 (worker MPM) with mod_jk 1.2.37 under Solaris 11, > compiled as follows (from config.log): > > --with-included-apr --with-mpm=worker --enable-so --enable-rewrite > --enable-headers --enable-proxy --enable-proxy-http --enable-expires > --enable-nonportable-atomics=yes --disable-include --disable-autoindex > --disable-imap --disable-userdir CC=/usr/sfw/bin/gcc > > We are running Tomcat 7.0.32. > > Since moving to Solaris 11 I'm noticing over time that apache children are > getting left in an idle state (and usually not showing up on the scoreboard > at all) when doing graceful restarts. If I do a hard restart, the error_log > notes that the process had to be forcibly killed: > > [Wed May 15 11:41:24 2013] [warn] child process 10057 still did not > exit, sending a SIGTERM [Wed May 15 11:41:26 2013] [error] child > process 10057 still did not exit, sending a SIGKILL > > If I let apache go unchecked, it will eventually stop passing traffic > completely and a hard restart is required. Example ps output looks like this: > > nobody 24429 20925 0 11:43:59 ? 0:02 /usr/local/apache2/bin/httpd > -k start > nobody 9750 20925 0 23:59:02 ? 0:00 /usr/local/apache2/bin/httpd > -k start > nobody 20925 2440 0 May 15 ? 3:07 /usr/local/apache2/bin/httpd > -k start > nobody 24689 20925 0 11:47:52 ? 0:00 /usr/local/apache2/bin/httpd > -k start > nobody 24628 20925 0 11:46:18 ? 0:01 /usr/local/apache2/bin/httpd > -k start > nobody 24428 20925 0 11:43:39 ? 0:02 /usr/local/apache2/bin/httpd > -k start > > Note PID 9750 is lingering, doing nothing according to pfiles and truss, and > its timestamp coincides with the last graceful restart (log rotation). Two > main differences between this web server and ones that are working include: > > a) This is Solaris 11 (vs. Solaris 10) > b) I have hardened apache by putting it in a Solaris 11 zone, and I'm > starting apache as the "nobody" user with the net_privaddr privilege so it > can function as the parent process. It talks to Tomcat on another zone and > everything works great (other than the problem described here). > > Apache has permission to write to /logs, and /log/apache2 is where I set > these: > > JkLogFile /logs/apache2/mod_jk.log > JkShmFile /logs/apache2/jk-runtime-status > > And this. > PidFile /logs/apache2/run/httpd.pid > > > Can anyone think of a reason why children are not being recycled or getting > stranded like this over successive graceful restarts? We do use multiple > listeners, so I don't know if I'm dealing with a locking/mutex/serialization > type of issue. I'm not a C programmer. There seems to be little info out > there for Solaris platforms that's recent. > > I'd be happy to post more info if needed. I appreciate your time. What does "pstack" show for such an abandoned child? Maybe another occurance of https://issues.apache.org/bugzilla/show_bug.cgi?id=49504. Regards, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org