Yah I saw this a lot when I wasn't closing thrift connections...but also saw it when the client would close prematurely and not return the transport to the thrift transport pool .
In one case I hadn't finished with the work in a thread but kept opening thrift connections since it would be 'time sliced' for io. In that case I opened too many sockets ( fds )...maybe hitting max open files because a transport isn't being returned in the middle of a work unit ? On Tue, Aug 30, 2016, 6:12 PM Christopher <ctubb...@apache.org> wrote: > Thrift is not happy on some replication ITs I've run lately. I had one test > timeout after 40 minutes... and it never finished. The symptom is lots of > client side messages about failure to open transport, and the server side > messages were (and both were occurring a *lot*, indicating indefinite > retries): > > 2016-08-30 19:48:13,476 [rpc.CustomNonBlockingServer$CustomFrameBuffer] > WARN : Got an IOException in internalRead! > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) > at > > org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:142) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:539) > at > > org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338) > at > > org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203) > at > > org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:203) > at > > org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154) > > I saw one comment on a mailing list somewhere that indicated this might be > caused by a client side handling of a custom Thrift Exception, not properly > closing the connection. It's possible we're doing something badly before we > retry. I think more investigation is needed before I file a JIRA (not even > sure what to file it against, right now... because I'm not sure what > component is even at fault). > > In the meantime, has anybody seen this? Does anybody have any insight into > this? This is all on a single node, running ITs. There really shouldn't be > any "network" problems which would cause a TCP reset from external to the > test and Accumulo itself, since it's all localhost. >