OK, we figured it out.  It's a case of too many timeout settings and not
having a real DevOps person on hand.  There was an obvious error message in
Apache's mod_jk.log that I failed to correlate with the problem because I
misread the timestamp on one of the many log entries.  Doh!

[info] ajp_connection_tcp_get_message::jk_ajp_common.c (1150): (node5)
can't receive the response header message from tomcat, network problems or
tomcat (10.xx.xx.xx:8009) is down (errno=11)
[error] ajp_get_reply::jk_ajp_common.c (1962): (node5) Tomcat is down or
refused connection. No response has been sent to the client (yet)
[info] ajp_service::jk_ajp_common.c (2447): (node5) sending request to
tomcat failed (recoverable),  (attempt=1)

There is, I now see, a socket_timeout and a socket_connect_timeout that do
not show up in the jkmanager status page, in addition to
connection_pool_timeout, connect_timeout,  prepost_timeout, and
reply_timeout which are listed by jkmanager.

We had socket_timeout set to 10 seconds and I didn't know it. Our
transactions only take longer than 10 seconds a few times a week, so that's
why we weren't seeing it that often.

We'll fix by setting:

    socket_timeout=90
    socket_connection_timeout=5000
    retry_options=25

unless someone has a better idea.

Thanks!
=Jeremy=


On Wed, Nov 16, 2011 at 5:29 PM, Jeremy <asfbugzi...@nuru.net> wrote:

> OK, I know this seems crazy, but I've looked long and hard and cannot
> explain this as other than a Tomcat bug.  I'm more than willing to dig up
> extra information where I can, but this is a very rare problem seen in
> production and not reproducible at will.  Also, because this is in
> production, there is very little I can change, and even if I did change it,
> it would take a month or more to verify that it is really a fix.  So what
> I'm really looking for is a detailed understanding of the problem so we can
> detect it when it happens and clean up the after-effects.
>
> What we are seeing is one POST request being handled twice by our web app.
>  It looks as though somehow the request is duplicated by Tomcat.  Here's
> our setup:
>
> Apache 2.2.14 using APR 1.3.8 and mod_jk 1.2.28 running on Ubuntu 10.04.3
> as front end, load balancing.
> Tomcat 6.0.29 using built-in AJP/1.3 connector running on Ubuntu 9.04 with
> 64-bit Sun Java 1.5.0_19.  Tomcat is not using APR.
> Tomcat mod_jk connector has keepAliveTimeout=5000 and
> connectionTimeout=60000 (60 seconds).
>
> Here is a simplified timeline of what we see in various logs.  I'm going
> to leave some fields out and change others to protect confidentiality and
> make it easier to follow, but I'll try to keep from changing anything
> important.  Timestamps are just minutes and seconds.
>
>
> 51:48 POST request received by Apache over SSL.  We see the request logged
> in Apache's access.log with this timestamp, but the log entry is actually
> written at 52:08 (or seems to be, since that is the time stamp of the next
> entry in the access.log, though the entries immediately before this one
> have timestamps of 52:02).
>
> 51:48 log4j application logs show processing of the request by the
> application in thread TP-Processor18.   Processing continues with updates
> to database and external systems until:
>
> 51:55 last application log entry from TP-Processor18 shows normal
> completion of processing, which should then return from handleRequest().
>  The database and three external systems the app communicates with all show
> normal processing.
>
> 52:00 log4j application log entries from TP-Processor6 show it beginning
> to handle the request.  Processing continues normally, just as it did in
> TP-Processor18 until:
>
> 52:08 last application log entry from TP-Processor6.  Again external
> systems show normal processing.
>
> 52:08 Tomcat access log shows entry for post request stating it was
> handled by TP-Processor6 in 7283ms
>
> 52:08 mod_jk.log: loadbalancer www.site.com 19.400742
>
> 53:06 Tomcat access log shows entry for post request stating it was
> handled by TP-Processor18 in 77440ms
>
>
> How is it possible that Tomcat has 2 threads handling the same request?
> How can we detect this as opposed to the user legitimately sending the
> same request twice?
>
> Thanks for your help!
> =Jeremy=
>
>
>
>

Reply via email to