Frantz Mazoyer created THRIFT-3313: -------------------------------------- Summary: Thrift java server hogs 100% CPU and clients are stuck Key: THRIFT-3313 URL: https://issues.apache.org/jira/browse/THRIFT-3313 Project: Thrift Issue Type: Bug Components: Java - Library Affects Versions: 0.9.2, 0.9.1, 0.9, 0.8, 0.7 Environment: Storm 0.9.5 (nimbus) Reporter: Frantz Mazoyer
Testing environment is Storm 0.9.5 / thrift java 0.7. Test scenario: Deploy storm topology in loop. When nimbus cleanup timeout is reached, an error is thrown by thrift server: "Exception while invoking ..." ... TException Test result: Thrift java server goes 100% CPU in infinite loop in: jstack: {code} "Thread-5" prio=10 tid=0x00007fb134aab800 nid=0x6767 runnable [0x00007fb129c9b000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) ... at org.apache.thrift7.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:284) {code} strace: {code} epoll_wait(70, {{EPOLLIN, {u32=866, u64=866}}, {EPOLLIN, {u32=876, u64=876}}}, 4096, 4294967295) = 2 {code} Investigation and tests show that: Any Exception thrown during the processor execution will bypass the call to {code} responseReady() {code} and will cause the counter {code} readBufferBytesAllocated.addAndGet(-buffer_.array().length); {code} not to be decremented by the size of the request buffer. After a bunch of failed requests, this counter almost reaches the max value MAX_READ_BUFFER_BYTES causing any subsequent request to be delayed forever because the following test in {code} read() {code}: {code} if (readBufferBytesAllocated.get() + frameSize > MAX_READ_BUFFER_BYTES) {code} is always true. At the end, the server thread loops in select() which immediately wakes up for read() since the content of the socket was never drained. This loops forever between select and read() method above causing a 100% CPU on server thread. Moreover, all client requests are stuck forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)