On 17/11/2011 08:48, Jeremy wrote: > Mark, thanks for taking a crack at this. I'm not that familiar with how > mod_jk and AJP work under the covers, but based on my level of > understanding, your scenario comes close but does not exactly match the > logging we see. My detailed comments below. > > On Wed, Nov 16, 2011 at 11:53 PM, Mark Thomas <ma...@apache.org> wrote: > >> On 17/11/2011 01:29, Jeremy wrote: >>> OK, I know this seems crazy, but I've looked long and hard and cannot >>> explain this as other than a Tomcat bug. I'm more than willing to dig up >>> extra information where I can, but this is a very rare problem seen in >>> production and not reproducible at will. Also, because this is in >>> production, there is very little I can change, and even if I did change >> it, >>> it would take a month or more to verify that it is really a fix. So what >>> I'm really looking for is a detailed understanding of the problem so we >> can >>> detect it when it happens and clean up the after-effects. >>> >>> What we are seeing is one POST request being handled twice by our web >> app. >>> It looks as though somehow the request is duplicated by Tomcat. Here's >>> our setup: >>> >>> Apache 2.2.14 using APR 1.3.8 and mod_jk 1.2.28 running on Ubuntu 10.04.3 >>> as front end, load balancing. >>> Tomcat 6.0.29 using built-in AJP/1.3 connector running on Ubuntu 9.04 >> with >>> 64-bit Sun Java 1.5.0_19. Tomcat is not using APR. >>> Tomcat mod_jk connector has keepAliveTimeout=5000 and >>> connectionTimeout=60000 (60 seconds). >>> >>> Here is a simplified timeline of what we see in various logs. I'm going >> to >>> leave some fields out and change others to protect confidentiality and >> make >>> it easier to follow, but I'll try to keep from changing anything >> important. >>> Timestamps are just minutes and seconds. >>> >>> >>> 51:48 POST request received by Apache over SSL. We see the request >> logged >>> in Apache's access.log with this timestamp, but the log entry is actually >>> written at 52:08 (or seems to be, since that is the time stamp of the >> next >>> entry in the access.log, though the entries immediately before this one >>> have timestamps of 52:02). >>> >>> 51:48 log4j application logs show processing of the request by the >>> application in thread TP-Processor18. Processing continues with updates >>> to database and external systems until: >>> >>> 51:55 last application log entry from TP-Processor18 shows normal >>> completion of processing, which should then return from handleRequest(). >>> The database and three external systems the app communicates with all >> show >>> normal processing. >>> >>> 51:55 log4j application log entries from TP-Processor6 show it beginning >> to >>> handle the request. Processing continues normally, just as it did in >>> TP-Processor18 until: >>> >>> 52:08 last application log entry from TP-Processor6. Again external >>> systems show normal processing. >>> >>> 52:08 Tomcat access log shows entry for post request stating it was >> handled >>> by TP-Processor6 in 7283ms >>> >>> 52:08 mod_jk.log: loadbalancer www.site.com 19.400742 >>> >>> 53:06 Tomcat access log shows entry for post request stating it was >> handled >>> by TP-Processor18 in 77440ms >>> >>> >>> How is it possible that Tomcat has 2 threads handling the same request? >> >> Because I think there are two requests here, rather than one. >> >> My guess as to what is happening is as follows: >> - httpd sends request to Tomcat >> - Tomcat processes it >> - network glitch or similar means httpd never receives the response >> > > I'm with you up to here. However, if the response were sent, then I'd > expect to see TP-Processor18 log its response in the Tomcat access log at > 51:55. Not some 71 seconds later,
It depends. If the response wasn't fully sent you could be seeing a retry timeout of some form here. > Also we have mod_jk set up with a reply timeout of 0, which means a retry > would be triggered by something other than a timeout. Ah. OK. That probably rules out that option then. > We do have mod_jk > retry set to 2 on he Apache servers; not sure how that is logged or what > protections are in place against creating this problem. Someone more familiar with the mod_jk code will hopefully suggest an appropriate log level to get some insight into this. > > - httpd resends request to Tomcat >> - Tomcat processes it >> - httpd receives the response >> - httpd sends response to user >> >> Examining the mod_jk logs on httpd should show this although I don't >> know what level of logging you'll need to see it. >> > > Me neither. We're using cping/cpong to test the connectors before sending > requests. Those do sometimes fail, but I think that keeps the request > from being sent in the first place. The mod_jk log just shows the one > request taking 19 seconds, doesn't show anything about retrying. > > I'm particularly curious about how TP-Processor6 manages to take over just > as TP-Processor18 should be sending its response and how TP-Processor18's > response ends up getting logged over a minute later. > >> >> The only hitch with the above is that I'd expect httpd to take longer to >> realise that there is a problem before re-trying the request. >> >> > Yep, me too. > > >>> How can we detect this as opposed to the user legitimately sending the >> same >>> request twice? >> >> Use mod_unique_id with mod_rewrite to add the unique ID as a url >> parameter and then include this in Tomcat's access log and/or write a >> filter that checks for duplicates. >> >> > Checking for duplicates is very problematic when the rest of the system is > not performing as expected. If the first request failed and is being > automatically retried as a recover step, then we want to process the > duplicate. Or maybe the first request is still in process and has not yet > persisted the record that it has handled this ID before the second thread > starts processing it and checks to see if has already been handled. The > current implementation has several checks to prevent duplicate processing > and they are all failing in this instance, which is part of why I'm asking > for help on this forum. Even though the logs suggest the first request was > handled normally, it appears the database changes made by the first request > processor are not visible to the second request processor at the time it > starts processing the request. I have to dig into that some more, but of > course it's secondary to why we're getting these duplicate requests > internally generated by Tomcat. > > >> Thanks for your help! >> >> The situation you describe could also be triggered by a Tomcat bug. You >> might want to try protocol="org.apache.coyote.ajp.AjpProtocol" on the >> connector to switch the the Coyote AJP BIO implementation. >> >> > We're currently using protocol="AJP/1.3" and do not have APR installed, so > I do not things changing to protocol="org.apache.coyote.ajp.AjpProtocol" > would actually change anything. Yes, it will. You are currently using: org.apache.jk.server.JkCoyoteHandler I'm suggesting a switch to: org.apache.coyote.ajp.AjpProtocol not the APR/native AJP connector which is: org.apache.coyote.ajp.AjpAprProtocol An upgrade to the next 6.0.35 when released is also worth considering. There have been some fixes to the connectors that might help (e.g. the fixes for CVE-2011-3190). Something to look for in your logs is what happens with the processors in the request before the one that looks odd. Is there a pattern of, for example, failed requests there? Note that there is not a 1 to 1 mapping between request processor objects and processor threads. That makes this tricky to be 100% sure but you might see something. HTH, Mark --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org