Hello,
Using Tomcat 8.0.23 and Tomcat 8.0.30 with Java 1.7.0_25 on CentOS 5.11
we are getting a stuck thread issue when our server is on high(er) load.
It seems to happen when one of our non-Container Threads invokes the
AsyncContext.complete() method while the AsyncStateMachine class’ state
class member is set to STARTING instead of STARTED, resulting in the
invocation of the Object’s wait() method in the
pauseNonContainerThread() method. It never seems to recover from this.
I tried a couple of things within our source code trying to get around this:
1. If HttpServletRequest.isAsyncStarted() returns true invoke
AsyncContext.complete(), but if it returns false sleep for an
arbitrary time and try again. Even if
HttpServletRequest.isAsyncStarted() returned true and I invoke
AsyncContext.complete() on occasion we still get the described
problem. I'm not sure if the HttpServletRequest.isAsyncStarted()
correlates with the AsyncStateMachine class’ state class member.
2. I tried a similar approach using the AsyncListener class, but again
without consistent success.
3. I tried enforcing an arbitrary delay even before invoking the
AsyncContext.complete() while HttpServletRequest.isAsyncStarted()
returned true, but again without consistent success.
I also tried a change within Tomcat's source code trying to get around this:
1. Instead of invoking the Object's wait() method I tried to invoke the
Object's wait(long timeout) method and checking if the state class
member ever seems to change from STARTING to STARTED when in this
situation. This didn't help either and the state class member seems
to be stuck in STARTING.
private synchronized void pauseNonContainerThread() {
while (!ContainerThreadMarker.isContainerThread() &&
state.getPauseNonContainerThread()) {
try {
System.out.println("*** pauseNonContainerThread ***
(before wait :: state: '" + state + "')");
*wait(500);*
System.out.println("*** pauseNonContainerThread ***
(after wait)");
} catch (InterruptedException e) {
// TODO Log this?
}
}
}
I can replicate the issue, but it takes quite a number of load tries
before I get it replicated. It also seems to be harder to replicate this
on Tomcat 8.0.30 than it is on Tomcat 8.0.23. I don’t have an isolated
test case (yet) that I can upload somewhere. However, I can provide logs
and thread dumps if needed.
In addition, looking at the Tomcat source code, I couldn’t figure out
what/who is supposed to invoke the Object's notify() method on the
non-Container Thread once Tomcat decided to invoke the Object’s wait()
method on it. As Tomcat decided to invoke the Object’s wait() method on
it I kind of expect Tomcat to be responsible for invoking the Object's
notify() method on it eventually when the right conditions are met or
something. I’m not saying this is a/the problem though, I'm just trying
to understand what would/should happen when you get into this state.
Regards,
Jeroen...