Re: mod_jk Problems - - worker went to error state and dont recover

Rainer Jung Wed, 20 Feb 2008 06:57:24 -0800

[EMAIL PROTECTED] wrote:

See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted on behalf of 
a User


Hallo to all, After long unsuccessful research i hope someone can
give me a hint to the following problems.

Our Apache-mod_jk-Tomcat Infrastructur was running without Problems
for about one year-than since two month mod_jk errors occurs.
We upgraded the mod_jk Version, made improvements in the
worker.properties - the problems changed and get less but sometimes they
appear further on.

It seems that the mod_jk worker loose the connection to their
Tomcat-Backendserver - there are messages in the mod_jk log Files which
points in this direction. Normally this seems not to be a big problem -
but under certain conditions (which ?) the worker goes to an error state
and cannot recover itself- must be done manually.

Problem 1: The Tomcats are reachable - unknown why the workers think the server 
is dead ?
Problem 2: I have no idea why the worker goes to an error state and cannot 
recover.


2 is a consequence of 1

Problem3: I miss explanations of logged messages - i read the messages - but 
cannot match them to the situation - when does a worker post this messages


1 is a consequence of these messages

[Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info] jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems (errno=110)
[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error] 
ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply from tomcat 
failed with out recovery in send loop attempt=0
[Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error] 
service::jk_lb_worker.c (1105): unrecoverable error 504, request failed. Tomcat 
failed in the middle of request, we can't recover to another instance.


The second line tells us, that your configured reply_timeout fired.

You set it to 120000 (2 minutes), so there are requests taking longerthan 2 minutes on the backend, before the first response packet comesback from the backend.

With your configuration mod_jk then doesn't wait any longer on the reply*and puts the backend into error mode*.

Up until version 1.2.25, if you use a reply-timeout, you need to set itto a high number which justifies the resoning "if it takes that long,that something is wrong with the backend".

Reality shows: there is no such number. Often there are few requeststhat take unaccetably long on the backend *although* the backend isstill working.

So in 1.2.25 we added max_reply_timeouts. With this set in addition toreply_timeout, mod_jk will abort waiting for a reply afterreply_timeout, but allow some timeouts before actually deciding to putthe backend into error.

Unfortunately the implementation of max_reply_timeouts in 1.2.25 waswrong, so you need to go to 1.2.26 to get it working right.


See:

http://issues.apache.org/bugzilla/show_bug.cgi?id=43229

Caution: this does *not* explain, why the backends are not automaticallyrecovered after a minute of error condition. Maybe you have times, whereyou getr to many of those reply_timeouts (see log file), and although werecover after a minute the backend almost immediately goes back intoerror status.

-> Which Timeout - how does mod_jk think Tomcat is down ? Where can i found 
details to errno=110 ?...


reply_timeout, see above and also

http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html

errno: a standard unix feature. The numbers are platform dependent. Iwould assume in your case


ETIMEDOUT       110     /* Connection timed out */

so no wonder, that's exactly what we expect (and doesn't tell us thereason, i.e. what's wrong on the *backend* taking that long for a response).

-> receiving reply from tomcat failed with out recovery in send loop attempt=0  
- ? with out recovery in send loop - means?

That your configuration doesn't allow us to send the request to anotherbackend. recovery_options 7 include: if mod_jk was able to send therequest to a backend, do not try to send it to another backend in caseof an error during the response handling. Even if you would allowsending to another backend, it would not help with *not* putting theworker into error state. More likely would be, that you would put allworkers into error state, because all of them might run into the sametimeout, one after the other.

-> unrecoverable error 504 - details to this error ?


That's simply how we return the situation back to the client (browser).


Ok - i turn the logging level to debug - the course of events get
more
clear - but also more questions appear - there are socket numbers -
which sockets - what are these numbers e.g will be shutting down socket
35 for worker INETP1021 - The sockets are good for ? - how many are
there/per worker ? can i configure them ?

Should not be the problem here. For apache httpd if you do *not*configure anything, we automatically choose the number of httpd threadsas the maximum number of connections. No need to change anything here.


=> Generally -How can i solve such problems - i tried to look into
the
mod_jk code - searching for error codes, error messages - but cannot
find some relevant informations, - i am studying the log Files - but
don't find out what really happens.


Post to the list. Improve our dics.

The error message contains the word "timeout" and "reply" and you have a"reply_timeout".

Long running requests are a frequent problem. If you want to get rid ofthem, start by adding response times to your httpd and your tomcataccess log format (%D). Then have a look, which URLs are producing longrunning requests, during what time of day are they happening etc. Thismight give you a clue about the reasons.

And if they are very frequent: do Java Thread Dumps of your backends andanalyze them.

So - maybe someone has an idea why the worker think that the
corresponding Tomcat is dead, and why he will not recover by itself. !

Tomecat is dead: from the point of view of mod_jk it simply means: wedidn't get an answer, when we expected one. Details depend on theadditional log lines (could not connect, reply timeout etc.).

And i am also searching for tips how i can help myself - and where to
find something about the error codes, messages,..in mod_jk

thanks for your attention
Best
ahmed musa (writing from vienna)


Regards,

Rainer

Current Infrastructur
We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3 /Kernelversion 
2.6.9-34
In front of the Webserver there are two (two Locations) HW-Loadbalancer (but 
they have no role in this story)
The Webservers are hosted at our ISP.

The Webserver balance the requests via mod_jk (Version 1.2.25) for

approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver - because of
underlying Application-Parts the OS is Windows 2003 Server - a long
story not worth to explain :-) ). The Tomcatserver gain Data via
Requests against DB2 Server/DB2-Databases on the Mainframe. The
Tomcatserver are Inhouse -and were rebooted nightly because of automated
Deployment processes.

Between the Webserver and the Tomcatserver is a Checkpoint Firewall.All webapps are deployed on all Tomcats - only mod_jk manages the

requests to certain Tomcat- instances.
(on one Bladeserver there are two identically Tomcat Instances
running).

Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests against
the public Website(s) are normal short living requests - not many - The
most Webapps (Portals) need a login, have a strong focus on business
logic - so the instances are big (many MBs in RAM), the sessions are
sticky and the session timeout is 20 minutes. But there are also less
requests. To the User requests - Monitoring requests from our ISP are added.
The Problems appears at Servers/Portals which very less Userrequests.

worker.properties
worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus

worker.template.type=ajp13
worker.template.lbfactor=5
worker.template.socket_keepalive=1
worker.template.connect_timeout=7000
worker.template.prepost_timeout=5000
worker.template.reply_timeout=120000
worker.template.retries=6
worker.template.activation=Active
worker.template.recovery_options=7

worker.lbtemplate.type=lb
worker.lbtemplate.max_reply_timeouts=6
worker.lbtemplate.method=Session

#Produktions Worker
# AS-INETP101 - 106 - 6/6 GGI
worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
worker.INETP1011.port=65001
worker.INETP1011.reference=worker.template

....many more of the same

then

worker.ajp_ad.reference=worker.lbtemplate
worker.ajp_ad.balance_workers=INETP1032,INETP1062

.... many more portals

at least jkstatus

The JKMount is very simple
JkMount /* ajp_ad    --- for the other portals mostly the same

The Portals are Virtual Hosts on the Apache.

Tomcat - server.xml
example
<Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
    <Engine name="Catalina" jvmRoute="INETP5021" defaultHost="default">
......
<Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
autoDeploy="false" deployOnStartup="false" xmlValidation="false"
xmlNamespaceAware="false">
        <Alias>www.slfinsol.com</Alias>
        <Alias>web1.slfinsol.com</Alias>
        ...
        <Alias>testweb.slfinsol.com</Alias>
        .....
        <Valve className="org.apache.catalina.valves.AccessLogValve"
directory="logs" prefix="swl_access_log." suffix=".txt" pattern="common"
resolveHosts="false" />
        <Valve
className="at.allianz.tomcat.valve.RequestTimeValve"/>
        <Valve
className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
        <Context path="" docBase="swl" />
        <Context path="/monitor5" docBase="monitor" />
        <Context path="/swl" docBase="swl" />

</Host>


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: mod_jk Problems - - worker went to error state and dont recover

Reply via email to