I am having the exact same issue. I see the connections pile up and pile up, but they never seem to come down. Any insight into this would be amazing.
Eric Plowe On Wed, Apr 9, 2014 at 4:17 PM, graham sanderson <gra...@vast.com> wrote: > Thanks Michael, > > Yup keepalive is not the default. It is possible they are going away after > nf_conntrack_tcp_timeout_established; will have to do more digging (it is > hard to tell how old a connection is - there are no visible timers (thru > netstat) on an ESTABLISHED connection))... > > This is actually low on my priority list, I was just spending a bit of > time trying to track down the source of > > ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 > ErrorMessage.java (line 222) Unexpected exception during request > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) > at > org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) > at > org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > errors, which are spamming our server logs quite a lot (I originally > thought this might be caused by KEEPALIVE, which is when I realized that > the connections weren't in keep alive and were building up) - it would be > nice if netty would tell us which a little about the Socket channel in the > error message (maybe there is a way to do this by changing log levels, but > as I say I haven't had time to go digging there) > > I will probably file a JIRA issue to add the setting (since I can't see > any particular harm to setting keepalive) > > On Apr 9, 2014, at 1:34 PM, Michael Shuler <mich...@pbandjelly.org> wrote: > > > On 04/09/2014 12:41 PM, graham sanderson wrote: > >> Michael, it is not that the connections are being dropped, it is that > >> the connections are not being dropped. > > > > Thanks for the clarification. > > > >> These server side sockets are ESTABLISHED, even though the client > >> connection on the other side of the network device is long gone. This > >> may well be an issue with the network device (it is valiantly trying > >> to keep the connection alive it seems). > > > > Have you tested if they *ever* time out on their own, or do they just > keep sticking around forever? (maybe 432000 sec (120 hours), which is the > default for nf_conntrack_tcp_timeout_established?) Trying out all the usage > scenarios is really the way to track it down - directly on switch, > behind/in front of firewall, on/off the VPN. > > > >> That said KEEPALIVE on the server side would not be a bad idea. At > >> least then the OS on the server would eventually (probably after 2 > >> hours of inactivity) attempt to ping the client. At that point > >> hopefully something interesting would happen perhaps causing an error > >> and destroying the server side socket (note KEEPALIVE is also good > >> for preventing idle connections from being dropped by other network > >> devices along the way) > > > > Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they > timeout after 2 hours, which is the default. > > > >> rpc_keepalive on the server sets keep alive on the server side > >> sockets for thrift, and is true by default > >> > >> There doesn't seem to be a setting for the native protocol > >> > >> Note this isn't a huge issue for us, they can be cleaned up by a > >> rolling restart, and this particular case is not production, but > >> related to development/testing against alpha by people working > >> remotely over VPN - and it may well be the VPNs fault in this case... > >> that said and maybe this is a dev list question, it seems like the > >> option to set keepalive should exist. > > > > Yeah, but I agree you shouldn't have to restart to clean up connections > - that's why I think it is lower in the network stack, and that a bit of > troubleshooting and tuning might be helpful. That setting sounds like a > good Jira request - keepalive may be the default, I'm not sure. :) > > > > -- > > Michael > > > >> On Apr 9, 2014, at 12:25 PM, Michael Shuler <mich...@pbandjelly.org> > >> wrote: > >> > >>> On 04/09/2014 11:39 AM, graham sanderson wrote: > >>>> Thanks, but I would think that just sets keep alive from the > >>>> client end; I'm talking about the server end... this is one of > >>>> those issues where there is something (e.g. switch, firewall, VPN > >>>> in between the client and the server) and we get left with > >>>> orphaned established connections to the server when the client is > >>>> gone. > >>> > >>> There would be no server setting for any service, not just c*, that > >>> would correct mis-configured connection-assassinating network gear > >>> between the client and server. Fix the gear to allow persistent > >>> connections. > >>> > >>> Digging through the various timeouts in c*.yaml didn't lead me to a > >>> simple answer for something tunable, but I think this may be more > >>> basic networking related. I believe it's up to the client to keep > >>> the connection open as Duy indicated. I don't think c* will > >>> arbitrarily sever connections - something that disconnects the > >>> client may happen. In that case, the TCP connection on the server > >>> should drop to TIME_WAIT. Is this what you are seeing in `netstat > >>> -a` on the server - a bunch of TIME_WAIT connections hanging > >>> around? Those should eventually be recycled, but that's tunable in > >>> the network stack, if they are being generated at a high rate. > >>> > >>> -- Michael > >> > > > >