Mark, thanks for taking a crack at this.  I'm not that familiar with how
mod_jk and AJP work under the covers, but based on my level of
understanding, your scenario comes close but does not exactly match the
logging we see.  My detailed comments below.

On Wed, Nov 16, 2011 at 11:53 PM, Mark Thomas <ma...@apache.org> wrote:

> On 17/11/2011 01:29, Jeremy wrote:
> > OK, I know this seems crazy, but I've looked long and hard and cannot
> > explain this as other than a Tomcat bug.  I'm more than willing to dig up
> > extra information where I can, but this is a very rare problem seen in
> > production and not reproducible at will.  Also, because this is in
> > production, there is very little I can change, and even if I did change
> it,
> > it would take a month or more to verify that it is really a fix.  So what
> > I'm really looking for is a detailed understanding of the problem so we
> can
> > detect it when it happens and clean up the after-effects.
> >
> > What we are seeing is one POST request being handled twice by our web
> app.
> >  It looks as though somehow the request is duplicated by Tomcat.  Here's
> > our setup:
> >
> > Apache 2.2.14 using APR 1.3.8 and mod_jk 1.2.28 running on Ubuntu 10.04.3
> > as front end, load balancing.
> > Tomcat 6.0.29 using built-in AJP/1.3 connector running on Ubuntu 9.04
> with
> > 64-bit Sun Java 1.5.0_19.  Tomcat is not using APR.
> > Tomcat mod_jk connector has keepAliveTimeout=5000 and
> > connectionTimeout=60000 (60 seconds).
> >
> > Here is a simplified timeline of what we see in various logs.  I'm going
> to
> > leave some fields out and change others to protect confidentiality and
> make
> > it easier to follow, but I'll try to keep from changing anything
> important.
> >  Timestamps are just minutes and seconds.
> >
> >
> > 51:48 POST request received by Apache over SSL.  We see the request
> logged
> > in Apache's access.log with this timestamp, but the log entry is actually
> > written at 52:08 (or seems to be, since that is the time stamp of the
> next
> > entry in the access.log, though the entries immediately before this one
> > have timestamps of 52:02).
> >
> > 51:48 log4j application logs show processing of the request by the
> > application in thread TP-Processor18.   Processing continues with updates
> > to database and external systems until:
> >
> > 51:55 last application log entry from TP-Processor18 shows normal
> > completion of processing, which should then return from handleRequest().
> >  The database and three external systems the app communicates with all
> show
> > normal processing.
> >
> > 51:55 log4j application log entries from TP-Processor6 show it beginning
> to
> > handle the request.  Processing continues normally, just as it did in
> > TP-Processor18 until:
> >
> > 52:08 last application log entry from TP-Processor6.  Again external
> > systems show normal processing.
> >
> > 52:08 Tomcat access log shows entry for post request stating it was
> handled
> > by TP-Processor6 in 7283ms
> >
> > 52:08 mod_jk.log: loadbalancer www.site.com 19.400742
> >
> > 53:06 Tomcat access log shows entry for post request stating it was
> handled
> > by TP-Processor18 in 77440ms
> >
> >
> > How is it possible that Tomcat has 2 threads handling the same request?
>
> Because I think there are two requests here, rather than one.
>
> My guess as to what is happening is as follows:
> - httpd sends request to Tomcat
> - Tomcat processes it
> - network glitch or similar means httpd never receives the response
>

I'm with you up to here.  However, if the response were sent, then I'd
expect to see  TP-Processor18 log its response in the Tomcat access log at
51:55.  Not some 71 seconds later,
Also we have mod_jk set up with a reply timeout of 0, which means a retry
would be triggered by something other than a timeout.  We do have mod_jk
retry set to 2 on he Apache servers; not sure how that is logged or what
protections are in place against creating this problem.

- httpd resends request to Tomcat
> - Tomcat processes it
> - httpd receives the response
> - httpd sends response to user
>
> Examining the mod_jk logs on httpd should show this although I don't
> know what level of logging you'll need to see it.
>

Me neither.  We're using cping/cpong to test the connectors before sending
requests.   Those do sometimes fail, but I think that keeps the request
from being sent in the first place.  The mod_jk log just shows the one
request taking 19 seconds, doesn't show anything about retrying.

I'm particularly curious about how TP-Processor6 manages to take over just
as TP-Processor18 should be sending its response and how TP-Processor18's
response ends up getting logged over a minute later.

>
> The only hitch with the above is that I'd expect httpd to take longer to
> realise that there is a problem before re-trying the request.
>
>
Yep, me too.


> > How can we detect this as opposed to the user legitimately sending the
> same
> > request twice?
>
> Use mod_unique_id with mod_rewrite to add the unique ID as a url
> parameter and then include this in Tomcat's access log and/or write a
> filter that checks for duplicates.
>
>
Checking for duplicates is very problematic when the rest of the system is
not performing as expected.  If the first request failed and is being
automatically retried as a recover step, then we want to process the
duplicate.  Or maybe the first request is still in process and has not yet
persisted the record that it has handled this ID before the second thread
starts processing it and checks to see if has already been handled.  The
current implementation has several checks to prevent duplicate processing
and they are all failing in this instance, which is part of why I'm asking
for help on this forum.  Even though the logs suggest the first request was
handled normally, it appears the database changes made by the first request
processor are not visible to the second request processor at the time it
starts processing the request.  I have to dig into that some more, but of
course it's secondary to why we're getting these duplicate requests
internally generated by Tomcat.


> Thanks for your help!
>
> The situation you describe could also be triggered by a Tomcat bug. You
> might want to try protocol="org.apache.coyote.ajp.AjpProtocol" on the
> connector to switch the the Coyote AJP BIO implementation.
>
>
We're currently using protocol="AJP/1.3" and do not have APR installed, so
I do not things changing to  protocol="org.apache.coyote.ajp.AjpProtocol"
would actually change anything.

Thanks,
=Jeremy=

Reply via email to