Re: mod_jk Problems - - worker went to error state and dont recover

Rainer Jung Thu, 21 Feb 2008 01:30:52 -0800

See the footer of any mail on the list:

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[EMAIL PROTECTED] wrote:

All

Apologies, this is unrelated. How do I unsubscribe from this mailing
list, I thought it would be useful and small but its overwhelming my
inbox?

Thanks in Advance.

Luke Walshe
BT Operate, HGIPCC Technical Specialist

Telephone: +44 (0)1314483482, Email: [EMAIL PROTECTED]

-----Original Message-----

From: Ahmed Musa [mailto:[EMAIL PROTECTED]Sent: 21 February 2008 09:25

To: Tomcat Users List
Subject: Re: mod_jk Problems - - worker went to error state and dont
recover

Hello Rainer,
Thanks for your informations - the Situation gets more clear now.
I will read again some dics - following your links and will make further
tests also with the improved logging.
Thanks a lot for your time

with best regardsahmed


-------- Original-Nachricht --------

Datum: Wed, 20 Feb 2008 18:59:01 +0100
Von: Rainer Jung <[EMAIL PROTECTED]>
An: Tomcat Users List <users@tomcat.apache.org>
Betreff: Re: mod_jk Problems - - worker went to error state and dont

recover

Ahmed Musa wrote:

Hello,
Wow -thank you very much Rainer for your very quick and informative

answer.

I will go to 1.2.26 and think about some "smoother" Values for

reply_timeout and max_reply_timeouts.

I will search for the requests which causes the Problems - becasue i

still log the response time in your mentioned way - but I am not sure

that the

Userrequests are responsible for the Situation.One note: for Apache httpd 2.x %d is microseconds (there is no formatfor milliseconds), for Tomcat %D is milliseconds. As long as you aresearching for the root cause, it might make sense to have both accesslogs active to check about duration differences.
So one further question - does mod_jk itself checks if the Backend

is

reachable - without userrequests?

No. Everything only works on top of user requests.

When there are connections to the Backend - are they closed after

the

respone or are the hold open for further requests.
In general hold open. There are parameters on how long they are heldopen without more requests before they get shut down, and also how

many

might be kept open even when no requests are coming in. Those are theconnection pool parameters, which you will find on
http://tomcat.apache.org/connectors-doc/reference/workers.html
Tomcat also has a connectionTimeout on the connector, which will shutdown a connection from the Tomcat side if it is idle for to long.
If you don't want to reuse connections at all, there's also a setting

(a

JkOption in Apache).
Is it possible that the Checkpoint Firewall in Between can be
responsible for the connectivity problem?
It can cut a connection that's idle for too long. Since you havecping/cpong active via connect_timeout and prepost_timeout, you should

get a cping error message, if the connection was dropped by the

firewall

during idle times and mod_jk tries to use it again. The reply timeout

in

the error log indicates, that the backend isn't answering. Of course

if

it takes *very* long to answer, it might be that the firewall droppedthe connection in between, but then the root cause would still be thelong response time of the backend.
Another point is the "not recovering" of the worker. Yes, you are

right

- in this situation i have many reply_timeouts - but these happens in

period of time - for example 30 minutes - but the worker is still dead

even

then when there are no more reply_timeouts. It remains dead.

It was necessary to restart it manually via jkstatus.

I assume you are using stickyness, so when a session started on a

node,

it will stay there. So when a worker is in error for a long time, allnew sessions will start on other nodes. If the worker is ready forrecovery, it needs a request, that doesn't carry a session to get

probed

with this request.
In jkstatus, the status of an error worker should switch to REC, whenmod_jk decides that it could send a non-sticky request there (to

probe)

and to PRB, during the time this request is on the node, and finallyeither to OK or back to ERR depending on the result of the request.
You can log the number of errors (and accesses) that happened on thenode in the httpd access log. If you think that the node simply stays

in

error for a long time, then the error count (and access count) shouldstay constant. I would expect, that they do not.
Have a look at how LogFormat in Apache httpd works, and then add some

of

those documented in

http://tomcat.apache.org/connectors-doc/reference/apache.html

like:

JK_LB_LAST_NAME
JK_LB_LAST_ACCESSED
JK_LB_LAST_ERRORS
JK_LB_LAST_BUSY
JK_LB_LAST_STATE

using the syntax %{JK_LB_LAST_STATE}n etc.

Another point is the learning - i read the dics - the infos on the

apache Website i dont't find other ones - are there other ones ? - and

they are

not going in depth - if you read the spec and watch the logs it is -

for me

- very hard to match the things. Also the many possibilities that

mod_jk

has to prove if there is a connection to the Backend,... - i

understand them

but check the reality in an error situation is very hard. Under

matching i

mean "Which Part of the Communication sequence failed - why - and

causes

which error message".

But i will try - and study also the mailing list..

It's hard for us too (sometimes).

Thank you for your time - tomorrow we will have the new version and

will

see what happens.

best
ahmed


Regards,

Rainer

-------- Original-Nachricht --------

Datum: Wed, 20 Feb 2008 15:56:42 +0100
Von: Rainer Jung <[EMAIL PROTECTED]>
An: Tomcat Users List <users@tomcat.apache.org>
Betreff: Re: mod_jk Problems - - worker went to error state and

dont

recover

[EMAIL PROTECTED] wrote:

See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted

on

behalf of a User

Hallo to all, After long unsuccessful research i hope someone can
give me a hint to the following problems.

Our Apache-mod_jk-Tomcat Infrastructur was running without

Problems

for about one year-than since two month mod_jk errors occurs.
We upgraded the mod_jk Version, made improvements in the
worker.properties - the problems changed and get less but

sometimes

they

appear further on.

It seems that the mod_jk worker loose the connection to their
Tomcat-Backendserver - there are messages in the mod_jk log Files

which

points in this direction. Normally this seems not to be a big

problem

but under certain conditions (which ?) the worker goes to an error

state

and cannot recover itself- must be done manually.

Problem 1: The Tomcats are reachable - unknown why the workers

think

the

server is dead ?

Problem 2: I have no idea why the worker goes to an error state

and

cannot recover.

2 is a consequence of 1

Problem3: I miss explanations of logged messages - i read the

messages

but cannot match them to the situation - when does a worker post

this

messages

1 is a consequence of these messages

[Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]

jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi

[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]

ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with

waiting

reply from

tomcat. Tomcat is down, stopped or network problems (errno=110)

[Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]

ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply

from

tomcat failed with

out recovery in send loop attempt=0

[Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]

service::jk_lb_worker.c (1105): unrecoverable error 504, request

failed. Tomcat failed in

the middle of request, we can't recover to another instance.

The second line tells us, that your configured reply_timeout fired.
You set it to 120000 (2 minutes), so there are requests taking

longer

than 2 minutes on the backend, before the first response packet

comes

back from the backend.

With your configuration mod_jk then doesn't wait any longer on the

reply

*and puts the backend into error mode*.

Up until version 1.2.25, if you use a reply-timeout, you need to

set it

to a high number which justifies the resoning "if it takes that

long,

that something is wrong with the backend".

Reality shows: there is no such number. Often there are few

requests

that take unaccetably long on the backend *although* the backend is

still working.

So in 1.2.25 we added max_reply_timeouts. With this set in addition

to

reply_timeout, mod_jk will abort waiting for a reply afterreply_timeout, but allow some timeouts before actually deciding to

put

the backend into error.

Unfortunately the implementation of max_reply_timeouts in 1.2.25

was

wrong, so you need to go to 1.2.26 to get it working right.

See:

http://issues.apache.org/bugzilla/show_bug.cgi?id=43229

Caution: this does *not* explain, why the backends are not

automatically

recovered after a minute of error condition. Maybe you have times,

where

you getr to many of those reply_timeouts (see log file), and

although

we

recover after a minute the backend almost immediately goes back

into

error status.

-> Which Timeout - how does mod_jk think Tomcat is down ? Where

can i

found details to errno=110 ?...

reply_timeout, see above and also

http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html

errno: a standard unix feature. The numbers are platform dependent.

I

would assume in your case

ETIMEDOUT       110     /* Connection timed out */

so no wonder, that's exactly what we expect (and doesn't tell us

the

reason, i.e. what's wrong on the *backend* taking that long for a
response).

-> receiving reply from tomcat failed with out recovery in send

loop

attempt=0  - ? with out recovery in send loop - means?

That your configuration doesn't allow us to send the request to

another

backend. recovery_options 7 include: if mod_jk was able to send the

request to a backend, do not try to send it to another backend in

case

of an error during the response handling. Even if you would allowsending to another backend, it would not help with *not* putting

the

worker into error state. More likely would be, that you would put

all

workers into error state, because all of them might run into the

same

timeout, one after the other.

-> unrecoverable error 504 - details to this error ?

That's simply how we return the situation back to the client

(browser).

Ok - i turn the logging level to debug - the course of events get
more
clear - but also more questions appear - there are socket numbers

which sockets - what are these numbers e.g will be shutting down

socket

35 for worker INETP1021 - The sockets are good for ? - how many

are

there/per worker ? can i configure them ?
Should not be the problem here. For apache httpd if you do *not*configure anything, we automatically choose the number of httpd

threads

as the maximum number of connections. No need to change anything

here.

=> Generally -How can i solve such problems - i tried to look into
the
mod_jk code - searching for error codes, error messages - but

cannot

find some relevant informations, - i am studying the log Files -

but

don't find out what really happens.

Post to the list. Improve our dics.

The error message contains the word "timeout" and "reply" and you

have

a

"reply_timeout".

Long running requests are a frequent problem. If you want to get

rid of

them, start by adding response times to your httpd and your tomcataccess log format (%D). Then have a look, which URLs are producing

long

running requests, during what time of day are they happening etc.

This

might give you a clue about the reasons.

And if they are very frequent: do Java Thread Dumps of your

backends

and

analyze them.

So - maybe someone has an idea why the worker think that the
corresponding Tomcat is dead, and why he will not recover by

itself. !

Tomecat is dead: from the point of view of mod_jk it simply means:

we

didn't get an answer, when we expected one. Details depend on theadditional log lines (could not connect, reply timeout etc.).
And i am also searching for tips how i can help myself - and where

to

find something about the error codes, messages,..in mod_jk

thanks for your attention
Best
ahmed musa (writing from vienna)

Regards,

Rainer

Current Infrastructur
We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3

/Kernelversion 2.6.9-34

In front of the Webserver there are two (two Locations)

HW-Loadbalancer

(but they have no role in this story)
The Webservers are hosted at our ISP.
The Webserver balance the requests via mod_jk (Version 1.2.25) for
approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -

because

of

underlying Application-Parts the OS is Windows 2003 Server - a

long

story not worth to explain :-) ). The Tomcatserver gain Data via
Requests against DB2 Server/DB2-Databases on the Mainframe. The
Tomcatserver are Inhouse -and were rebooted nightly because of

automated

Deployment processes.

Between the Webserver and the Tomcatserver is a Checkpoint

Firewall.

All webapps are deployed on all Tomcats - only mod_jk manages the
requests to certain Tomcat- instances.
(on one Bladeserver there are two identically Tomcat Instances
running).

Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests

against

the public Website(s) are normal short living requests - not many

The

most Webapps (Portals) need a login, have a strong focus on

business

logic - so the instances are big (many MBs in RAM), the sessions

are

sticky and the session timeout is 20 minutes. But there are also

less

requests. To the User requests - Monitoring requests from our ISP

are

added.

The Problems appears at Servers/Portals which very less

Userrequests.

worker.properties
worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus

worker.template.type=ajp13
worker.template.lbfactor=5
worker.template.socket_keepalive=1
worker.template.connect_timeout=7000
worker.template.prepost_timeout=5000
worker.template.reply_timeout=120000
worker.template.retries=6
worker.template.activation=Active
worker.template.recovery_options=7

worker.lbtemplate.type=lb
worker.lbtemplate.max_reply_timeouts=6
worker.lbtemplate.method=Session

#Produktions Worker
# AS-INETP101 - 106 - 6/6 GGI
worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
worker.INETP1011.port=65001
worker.INETP1011.reference=worker.template

....many more of the same

then

worker.ajp_ad.reference=worker.lbtemplate
worker.ajp_ad.balance_workers=INETP1032,INETP1062

.... many more portals

at least jkstatus

The JKMount is very simple
JkMount /* ajp_ad    --- for the other portals mostly the same

The Portals are Virtual Hosts on the Apache.

Tomcat - server.xml
example
<Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
    <Engine name="Catalina" jvmRoute="INETP5021"

defaultHost="default">

......
<Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
autoDeploy="false" deployOnStartup="false" xmlValidation="false"
xmlNamespaceAware="false">
        <Alias>www.slfinsol.com</Alias>
        <Alias>web1.slfinsol.com</Alias>
        ...
        <Alias>testweb.slfinsol.com</Alias>
        .....
        <Valve

className="org.apache.catalina.valves.AccessLogValve"

directory="logs" prefix="swl_access_log." suffix=".txt"

pattern="common"

resolveHosts="false" />
        <Valve
className="at.allianz.tomcat.valve.RequestTimeValve"/>
        <Valve

className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>

        <Context path="" docBase="swl" />
        <Context path="/monitor5" docBase="monitor" />
        <Context path="/swl" docBase="swl" />

</Host>

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: mod_jk Problems - - worker went to error state and dont recover

Reply via email to