Thank you so much for the reply.  Here are a couple of examples, as I'm not 
completely sure if my symptoms match, though the pstacks do look very similar 
to my untrained eye:


Here is a two day-old child:

27743:  /usr/local/apache2/bin/httpd -k start
-----------------  lwp# 1 / thread# 1  --------------------
 ff00a42c lwp_wait (3, ffbff804)
 ff001e88 _thrp_join (3, 0, ffbff86c, 1, ff0b2780, ffbff804) + 38
 ff214544 apr_thread_join (ffbff8ec, 32eea8, 7, 0, dc328, b15e0) + c
 0008c43c join_workers (0, fe3aa8, 8bfcc, 32ec30, 0, 1) + ec
 0008c790 child_main (2, 8b31c, 0, feee2a40, ff0b2840, ff0b2780) + 270
 0008c970 make_child (c7800, 2, 0, c8800, c7000, c8400) + 128
 0008d1b4 ap_mpm_run (fe4100f8, e, 0, 1, 27, 1) + 754
 000343c0 main     (d6218, d8190, ffbffc54, c7800, c7800, 0) + 79c
 00033754 _start   (0, 0, 0, 0, 0, 0) + 5c
-----------------  lwp# 3 / thread# 3  --------------------
 ff0058d4 lwp_park (0, 0, 0)
 fefff6e8 cond_wait_queue (32ecc8, 32ec98, 0, 0, 0, 0) + 4c
 fefffd30 cond_wait (32ecc8, 32ec98, 0, 0, fe460a40, 0) + 10
 fefffd6c pthread_cond_wait (32ecc8, 32ec98, 0, 0, 32ec98, 0) + 8
 0008e674 ap_queue_pop (32ec78, fe30bf1c, fe30bf18, 4, 0, 32ee40) + 64
 0008be1c worker_thread (32eea8, 2, fe460a40, c8400, c8400, 0) + 10c
 ff21440c dummy_worker (32eea8, 0, 0, fe460a40, ff214400, 1) + c
 ff005850 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 4 / thread# 4  --------------------
 ff0058d4 lwp_park (0, 0, 0)
 fefff6e8 cond_wait_queue (32ecc8, 32ec98, 0, 0, 0, 0) + 4c
 fefffd30 cond_wait (32ecc8, 32ec98, 0, 0, fe461240, 11692d8) + 10
 fefffd6c pthread_cond_wait (32ecc8, 32ec98, 0, 0, 32ec98, 0) + 8
 0008e674 ap_queue_pop (32ec78, fe20bf1c, fe20bf18, 0, 0, 32ee40) + 64
 0008be1c worker_thread (32eec8, 2, fe461240, c8400, c8400, 4) + 10c
 ff21440c dummy_worker (32eec8, 0, 0, fe461240, ff214400, 1) + c
 ff005850 _lwp_start (0, 0, 0, 0, 0, 0)

...and several more in lwp_park.



And here's another one that's a day old, but looks different (including lots of 
jk references):

7934:   /usr/local/apache2/bin/httpd -k start
-----------------  lwp# 1 / thread# 1  --------------------
 ff00a42c lwp_wait (6, ffbff80c)
 ff001e88 _thrp_join (6, 0, ffbff874, 1, ff0b2780, ffbff80c) + 38
 ff214544 apr_thread_join (ffbff8f4, 28e228, 2, 0, 1, b1600) + c
 0008c43c join_workers (c, 3c5f38, 8bfcc, 28df50, 0, 1) + ec
 0008c790 child_main (0, 8b31c, 0, feee2a40, ff0b2840, ff0b2780) + 270
 0008c970 make_child (c7800, 0, 0, c8800, c7000, c8400) + 128
 0008d1b4 ap_mpm_run (fe4100f8, e, 0, 1, 26, 1) + 754
 000343c0 main     (d6218, d8190, ffbffc5c, c7800, c7800, 0) + 79c
 00033754 _start   (0, 0, 0, 0, 0, 0) + 5c
-----------------  lwp# 6 / thread# 6  --------------------
 ff00a14c read     (15, fe00a908, 4)
 fe4a87dc jk_tcp_socket_recvfull (15, fe00a908, 4, 2e4bf8, 510, 4ec) + 74
 fe4c3088 ajp_connection_tcp_get_message (35f130, 35f168, 2e4bf8, 361188, 2000, 
2064) + 44
 fe4c5588 ajp_get_reply (361168, fe00bb50, 2e4bf8, 35f130, fe00aa70, 2028) + 9c
 fe4c9304 ajp_service (361168, fe00bb50, 2e4bf8, fe00ab38, 1, c00) + 22b8
 fe4a1234 jk_handler (23c, 35e740, 3f4390, 1, 13, 3544c8) + 9e4
 00047534 ap_run_handler (3f40a0, 0, 11, 3e7028, 3f5a08, 0) + 3c
 000479c0 ap_invoke_handler (3f40a0, 9d000, 3f40a0, 0, fe410028, 0) + c0
 00073aa4 ap_process_request (3f40a0, 3, 4, 3f40a0, c8420, 21d8d8) + 160
 00070b34 ap_process_http_connection (3d52e8, 3d5038, 3d5038, 3, c8420, 211980) 
+ 10c
 0004dce8 ap_run_process_connection (3d52e8, 3d5038, 3d5038, 3, 3d52e0, 3d7068) 
+ 3c
 0008bf1c worker_thread (28e228, 0, fe462240, c8400, c8400, c) + 20c
 ff21440c dummy_worker (28e228, 0, 0, fe462240, ff214400, 1) + c
 ff005850 _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 7 / thread# 7  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 8 / thread# 8  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 9 / thread# 9  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 10 / thread# 10  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 11 / thread# 11  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 12 / thread# 12  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 13 / thread# 13  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **
-----------------  lwp# 14 / thread# 14  --------------------
 ff214400 dummy_worker(), exit value = 0x00000000
        ** zombie (exited, not detached, not yet joined) **

...and so on...



If anyone has the time to confirm my case is a match I'd be very grateful but 
this patch looks promising!

Thank you VERY MUCH!


-Chris





-----Original Message-----
From: Rainer Jung [mailto:rainer.j...@kippdata.de] 
Sent: Saturday, June 22, 2013 12:31 PM
To: users@tomcat.apache.org
Subject: Re: Abandoned apache children with mod_jk

On 21.06.2013 19:47, Chris Boyce wrote:
> Hello,
> 
> I'm running apache 2.2.24 (worker MPM) with mod_jk 1.2.37 under Solaris 11, 
> compiled as follows (from config.log):
> 
> --with-included-apr --with-mpm=worker --enable-so --enable-rewrite 
> --enable-headers --enable-proxy --enable-proxy-http --enable-expires 
> --enable-nonportable-atomics=yes --disable-include --disable-autoindex 
> --disable-imap --disable-userdir CC=/usr/sfw/bin/gcc
> 
> We are running Tomcat 7.0.32.
> 
> Since moving to Solaris 11 I'm noticing over time that apache children are 
> getting left in an idle state (and usually not showing up on the scoreboard 
> at all) when doing graceful restarts.  If I do a hard restart, the error_log 
> notes that the process had to be forcibly killed:
> 
> [Wed May 15 11:41:24 2013] [warn] child process 10057 still did not 
> exit, sending a SIGTERM [Wed May 15 11:41:26 2013] [error] child 
> process 10057 still did not exit, sending a SIGKILL
> 
> If I let apache go unchecked, it will eventually stop passing traffic 
> completely and a hard restart is required.  Example ps output looks like this:
> 
> nobody 24429 20925   0 11:43:59 ?           0:02 /usr/local/apache2/bin/httpd 
> -k start
> nobody  9750 20925   0 23:59:02 ?           0:00 /usr/local/apache2/bin/httpd 
> -k start
> nobody 20925  2440   0   May 15 ?           3:07 /usr/local/apache2/bin/httpd 
> -k start
> nobody 24689 20925   0 11:47:52 ?           0:00 /usr/local/apache2/bin/httpd 
> -k start
> nobody 24628 20925   0 11:46:18 ?           0:01 /usr/local/apache2/bin/httpd 
> -k start
> nobody 24428 20925   0 11:43:39 ?           0:02 /usr/local/apache2/bin/httpd 
> -k start
> 
> Note PID 9750 is lingering, doing nothing according to pfiles and truss, and 
> its timestamp coincides with the last graceful restart (log rotation).  Two 
> main differences between this web server and ones that are working include:
> 
> a) This is Solaris 11 (vs. Solaris 10)
> b) I have hardened apache by putting it in a Solaris 11 zone, and I'm 
> starting apache as the "nobody" user with the net_privaddr privilege so it 
> can function as the parent process.  It talks to Tomcat on another zone and 
> everything works great (other than the problem described here).
> 
> Apache has permission to write to /logs, and /log/apache2 is where I set 
> these:
> 
> JkLogFile /logs/apache2/mod_jk.log
> JkShmFile /logs/apache2/jk-runtime-status
> 
> And this.
> PidFile /logs/apache2/run/httpd.pid
> 
> 
> Can anyone think of a reason why children are not being recycled or getting 
> stranded like this over successive graceful restarts?  We do use multiple 
> listeners, so I don't know if I'm dealing with a locking/mutex/serialization 
> type of issue.  I'm not a C programmer.  There seems to be little info out 
> there for Solaris platforms that's recent.  
> 
> I'd be happy to post more info if needed.  I appreciate your time.

What does "pstack" show for such an abandoned child?

Maybe another occurance of
https://issues.apache.org/bugzilla/show_bug.cgi?id=49504.

Regards,

Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to