Hi Team,
My team are facing a no responding issue in the below circumstances:

1. Env:
 ApacheTomcat:8.5.15, JDK: 1.8.0_121

2. Tomcat configuration:
 enable APR: protocol="org.apache.coyote.http11.Http11AprProtocol"
 set maxThreads of the Executor: maxThreads="1200"

3. This web server was under a massive load.
 All requests were HTTP 1.1 requests and were marked with a "Connection:
close" HTTP header.
 At this point, web server showed some latency for the responses, but it is
still running.

4. Some "keep-alive" requests were coming.
 Those requests were marked with a "Connection: keep-alive" HTTP header.

5. The following non-keep-alive requests was not responding for a long time.
 We run the thread dump with jstack at this time, we saw Acceptor thread is
WAITING:
 "http-apr-8080-Acceptor@5830" daemon prio=5 tid=0x27 nid=NA waiting
  java.lang.Thread.State: WAITING
 at sun.misc.Unsafe.park(Unsafe.java:-1)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at
org.apache.tomcat.util.threads.LimitLatch.countUpOrAwait(LimitLatch.java:117)
 at
org.apache.tomcat.util.net.AbstractEndpoint.countUpOrAwaitConnection(AbstractEndpoint.java:1292)
 at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:83)
 at java.lang.Thread.run(Thread.java:748)

 But The Poller thread is RUNNING:
 "http-apr-8080-Poller@5809" daemon prio=5 tid=0x25 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
 at org.apache.tomcat.jni.Poll.poll(Poll.java:-1)
 at org.apache.tomcat.util.net.AprEndpoint$Poller.run(AprEndpoint.java:1488)
 at java.lang.Thread.run(Thread.java:748)

 All the Executor threads are WAITING like this:
 "http-apr-8080-exec-21@6078" daemon prio=5 tid=0x3f nid=NA waiting
  java.lang.Thread.State: WAITING
 at sun.misc.Unsafe.park(Unsafe.java:-1)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
 at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:89)
 at org.apache.tomcat.util.threads.TaskQueue.poll(TaskQueue.java:33)
 at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 at java.lang.Thread.run(Thread.java:748)

We dived in the source code, found some clues:

1. Acceptor thread is parking by the LimitLatch. It means LimitLatch is not
released by other occupations, which also means the previous connections do
not perform the close operation in our circumstances.
(AprEndpoint.java#L955)

2. The close logic is take place in the Poller thread.
(AprEndpoint.java#L1624)

3. If the polling logic takes lot of time, the Poller thread will be
blocked(although it is still running, it could be blocked by the native
method), and the
destroySocket method will be suspended. (AprEndpoint.java#L1680)

4. Because the Acceptor processes the new connection directly(not registers
to the poller, AprEndpoint.java#L2268). So the pollerSpace[i] always equals
to actualPollerSize at this case(AprEndpoint.java#L1679), the
"nextPollerTime" will be increased so large. But when some "keep-alive"
requests arrive, the Handler implementation will process those connections
and register each back to the poller (AbstractProtocol.java#L933), so the
pollerSpace is changed, and the Poller will use a large value of
"nextPollerTime" to poll the socket, so the Poller thread would blocked in
a long time.

To prove that, we setup a similar environment to reproduce this issue:

1. Use official tomcat docker image(tomcat:8.5.15) to run a simple http
server

2. Change the config:
    <Connector
    port="8080"
    protocol="org.apache.coyote.http11.Http11AprProtocol"
    connectionTimeout="20000"
    redirectPort="8443"
    maxConnections="20" /> // make testing easy to reached max connections

3. Keep sending lots of "non-keep-alive" requests in 5 min
 $ wrk2 -t8 -c32 -d10m -R128 -s ./closed.lua
http://127.0.0.1:8080/hello?latency=10

4. Send a single "keep-alive" request and do not close this connection on
client side

5. After that, send another "non-keep-alive" request. We can see no
response returned in a reasonable time(waiting in 30 sec).

A workaround:
By set deferAccept="false" for the connector configuration, we can force
Acceptor to register the new connection to the
poller(AprEndpoint.java#L2250), and "nextPollerTime" will not lose control.

So guys, is that a real issue for Tomcat?



Best Regards,

Chang Liu

Reply via email to