Thank you for the help. You spend a great deal of time help folks on here (as evidenced by the absurd number of times you have answered questions like this :))...and it is appreciated.
I can't believe I missed the millisecond thing. :) I'll check the log and the dump for the individual problems and post to this thread with the results (or more questions). LES Rainer Jung-3 wrote: > > On 24.05.2010 23:36, LES wrote: >> >> I am having some trouble keeping a mod_jk setup stable. At this point, I >> feel like I am too far into trial and error mode and would like some help >> figuring out how to identify the problem. >> >> My current setup involves, two linux (RHEL 5) server each running two >> tomcat >> instances (6.0.20). A third RHEL 5 box is running apache (2.2.3) with >> mod_jk(1.2.28). I am using terracotta to "cluster" the tomcat sessions. >> >> The problem that I am having is that under small load (and unfortunately, >> intermittently), I get random nodes that produce errors. Typically these >> errors indicate that mod_jk can no longer contact tomcat (see excerpts >> below). In most cases, the the user request just hangs (never returns). >> So, it also appears that the errors are not causing a session failover -- >> though I need to confirm that again after my recent round of changes. In >> most cases, these nodes that are in error recover on their own. However, >> during the failure event, I get a bunch of unhappy users. I am hoping to >> find a way to make the nodes more stable and then address the fail-over >> aspect. >> >> I have tried different mod_jk parameters and think I have settled on a >> decent set of them. I have all of the garbage collection information >> logging out and do not seem to have any gc events that are taking longer >> than the request timeout. I am gathering jvm and os stats and do not see >> a >> hardware constraint (memory, cpu, io). So, I am a bit of a loss on where >> to >> look. >> >> I am pasting in all of the relevant files/excerpts that I can think of. >> I >> appreciate any advice on what additional data to gather to shed light on >> this problem (outright solutions are welcome too :)). >> >> Please let me know if there is any other information that would be >> helpful. >> >> Thanx, >> LES >> >> >> ************* workers.properties ************** >> # Define 1 real worker using ajp13 >> worker.list=lb,jkstatus,cas >> # Set properties for worker1 (ajp13) >> worker.template.type=ajp13 >> worker.template.retries=4 >> worker.template.lbfactor=1 >> worker.template.reply_timeout=300000 >> worker.template.max_reply_timeouts=4 >> worker.template.connection_pool_timeout=60 >> worker.template.ping_mode=A >> #worker.template.socket_timeout=10 > > This is in milliseconds, I guess you want 10000: > >> worker.template.socket_connect_timeout=10 >> >> worker.tomcat01-instance1.reference=worker.template >> worker.tomcat01-instance1.host=tomcat01.barnhardt.local >> worker.tomcat01-instance1.port=8009 >> >> worker.tomcat01-instance2.reference=worker.template >> worker.tomcat01-instance2.host=tomcat01.barnhardt.local >> worker.tomcat01-instance2.port=18009 >> >> worker.tomcat02-instance1.reference=worker.template >> worker.tomcat02-instance1.host=tomcat02.barnhardt.local >> worker.tomcat02-instance1.port=8009 >> >> worker.tomcat02-instance2.reference=worker.template >> worker.tomcat02-instance2.host=tomcat02.barnhardt.local >> worker.tomcat02-instance2.port=18009 >> >> worker.cas.type=ajp13 >> worker.cas.host=localhost >> worker.cas.port=8009 >> worker.cas.lbfactor=1 >> worker.cas.connection_pool_timeout=600 >> worker.cas.socket_keepalive=1 > > I don't like the raw socket_timeout, but well ... > >> worker.cas.socket_timeout=60 >> >> # Set properties for lb which use the other workers >> worker.lb.type=lb >> #worker.lb.method=B >> worker.lb.sticky_session=True >> worker.lb.balance_workers=tomcat01-instance1,tomcat01-instance2,tomcat02-instance1,tomcat02-instance2 >> >> # Define a 'jkstatus' worker using status >> worker.jkstatus.type=status >> *********************************************** >> >> >> ****** Errors from log ******* >> >> //////This particular error(info) seems to happen constantly - is it a >> normal operational thing? > > Yes, it is not an error, it is an "info2 message. It simply says that > all connections from your apache process to tomcat were closed and a > fresh one had to be opened. > >> [Mon May 24 10:22:56 2010] [26131:4045374208] [info] >> ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance2) all >> endpoints >> are disconnected, detected by connect check (1), cping (0), send (0) >> [Mon May 24 11:55:21 2010] [2711:4045374208] [info] >> ajp_send_request::jk_ajp_common.c (1496): (tomcat02-instance1) all >> endpoints >> are disconnected, detected by connect check (1), cping (0), send (0) >> [Mon May 24 13:08:25 2010] [27439:4045374208] [info] >> ajp_send_request::jk_ajp_common.c (1496): (tomcat01-instance1) all >> endpoints >> are disconnected, detected by connect check (1), cping (0), send (0) > > So I'd say somethoing gets stuck in your tomcat (likely: your webapp) > and mod_jk detects that by use of the reply timeout. Since you have a 5 > minute reply timeout, chances are good to find those request and the > cause for their hanging or excessively long response time by use of > > - a tomcat access log with an improved patern containing "%D" and if > your Tomcat is recent enough also "%I" > - and regular thread dumps > >> ////This error happens intermittently and seems to cause some the the >> cluster problems I mentioned above >> [Mon May 24 07:19:21 2010] [27432:4045374208] [error] >> ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with >> waiting reply from tomcat. Tomcat is down, stopped or network problems >> (errno=110) >> [Mon May 24 07:19:23 2010] [27432:4045374208] [info] >> ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request >> to >> tomcat failed (recoverable), because of reply timeout (attempt=1) >> [Mon May 24 07:24:23 2010] [27432:4045374208] [error] >> ajp_get_reply::jk_ajp_common.c (1926): (tomcat01-instance2) Timeout with >> waiting reply from tomcat. Tomcat is down, stopped or network problems >> (errno=110) >> [Mon May 24 07:24:25 2010] [27432:4045374208] [info] >> ajp_service::jk_ajp_common.c (2447): (tomcat01-instance2) sending request >> to >> tomcat failed (recoverable), because of reply timeout (attempt=2) > > I guess the nex one is due to the socket_connect_timeout set to 10 > milliseconds instead of 10 seconds: > >> ////I get this error occassionally, too >> [Sun May 23 03:48:51 2010] [15814:4045374208] [info] >> jk_open_socket::jk_connect.c (594): connect to 192.168.60.157:8009 failed >> (errno=115) >> [Sun May 23 03:48:51 2010] [15814:4045374208] [info] >> ajp_connect_to_endpoint::jk_ajp_common.c (922): Failed opening socket to >> (192.168.60.157:8009) (errno=115) >> [Sun May 23 03:48:51 2010] [15814:4045374208] [error] >> ajp_send_request::jk_ajp_common.c (1507): (tomcat02-instance1) connecting >> to >> backend failed. Tomcat is probably not started or is listening on the >> wrong >> port (errno=115) > > Error number 104 (errno=104) is "Connection reset by peer" n RHEL 5: > >> ////Third time is a charm...another error for the hat trick >> [Sat May 22 21:41:17 2010] [13933:4045374208] [info] >> ajp_connection_tcp_get_message::jk_ajp_common.c (1150): >> (tomcat01-instance1) >> can't receive the response header message from tomcat, network problems >> or >> tomcat (192.168.60.156:8009) is down (errno=104) > > Regards, > > Rainer > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > > > -- View this message in context: http://old.nabble.com/mod_jk-stability-issues-tp28662097p28667920.html Sent from the Tomcat - User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org