OK, we figured it out. It's a case of too many timeout settings and not having a real DevOps person on hand. There was an obvious error message in Apache's mod_jk.log that I failed to correlate with the problem because I misread the timestamp on one of the many log entries. Doh!
[info] ajp_connection_tcp_get_message::jk_ajp_common.c (1150): (node5) can't receive the response header message from tomcat, network problems or tomcat (10.xx.xx.xx:8009) is down (errno=11) [error] ajp_get_reply::jk_ajp_common.c (1962): (node5) Tomcat is down or refused connection. No response has been sent to the client (yet) [info] ajp_service::jk_ajp_common.c (2447): (node5) sending request to tomcat failed (recoverable), (attempt=1) There is, I now see, a socket_timeout and a socket_connect_timeout that do not show up in the jkmanager status page, in addition to connection_pool_timeout, connect_timeout, prepost_timeout, and reply_timeout which are listed by jkmanager. We had socket_timeout set to 10 seconds and I didn't know it. Our transactions only take longer than 10 seconds a few times a week, so that's why we weren't seeing it that often. We'll fix by setting: socket_timeout=90 socket_connection_timeout=5000 retry_options=25 unless someone has a better idea. Thanks! =Jeremy= On Wed, Nov 16, 2011 at 5:29 PM, Jeremy <asfbugzi...@nuru.net> wrote: > OK, I know this seems crazy, but I've looked long and hard and cannot > explain this as other than a Tomcat bug. I'm more than willing to dig up > extra information where I can, but this is a very rare problem seen in > production and not reproducible at will. Also, because this is in > production, there is very little I can change, and even if I did change it, > it would take a month or more to verify that it is really a fix. So what > I'm really looking for is a detailed understanding of the problem so we can > detect it when it happens and clean up the after-effects. > > What we are seeing is one POST request being handled twice by our web app. > It looks as though somehow the request is duplicated by Tomcat. Here's > our setup: > > Apache 2.2.14 using APR 1.3.8 and mod_jk 1.2.28 running on Ubuntu 10.04.3 > as front end, load balancing. > Tomcat 6.0.29 using built-in AJP/1.3 connector running on Ubuntu 9.04 with > 64-bit Sun Java 1.5.0_19. Tomcat is not using APR. > Tomcat mod_jk connector has keepAliveTimeout=5000 and > connectionTimeout=60000 (60 seconds). > > Here is a simplified timeline of what we see in various logs. I'm going > to leave some fields out and change others to protect confidentiality and > make it easier to follow, but I'll try to keep from changing anything > important. Timestamps are just minutes and seconds. > > > 51:48 POST request received by Apache over SSL. We see the request logged > in Apache's access.log with this timestamp, but the log entry is actually > written at 52:08 (or seems to be, since that is the time stamp of the next > entry in the access.log, though the entries immediately before this one > have timestamps of 52:02). > > 51:48 log4j application logs show processing of the request by the > application in thread TP-Processor18. Processing continues with updates > to database and external systems until: > > 51:55 last application log entry from TP-Processor18 shows normal > completion of processing, which should then return from handleRequest(). > The database and three external systems the app communicates with all show > normal processing. > > 52:00 log4j application log entries from TP-Processor6 show it beginning > to handle the request. Processing continues normally, just as it did in > TP-Processor18 until: > > 52:08 last application log entry from TP-Processor6. Again external > systems show normal processing. > > 52:08 Tomcat access log shows entry for post request stating it was > handled by TP-Processor6 in 7283ms > > 52:08 mod_jk.log: loadbalancer www.site.com 19.400742 > > 53:06 Tomcat access log shows entry for post request stating it was > handled by TP-Processor18 in 77440ms > > > How is it possible that Tomcat has 2 threads handling the same request? > How can we detect this as opposed to the user legitimately sending the > same request twice? > > Thanks for your help! > =Jeremy= > > > >