Thanks Michael, Yup keepalive is not the default. It is possible they are going away after nf_conntrack_tcp_timeout_established; will have to do more digging (it is hard to tell how old a connection is - there are no visible timers (thru netstat) on an ESTABLISHED connection))…
This is actually low on my priority list, I was just spending a bit of time trying to track down the source of ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 ErrorMessage.java (line 222) Unexpected exception during request java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) errors, which are spamming our server logs quite a lot (I originally thought this might be caused by KEEPALIVE, which is when I realized that the connections weren’t in keep alive and were building up) - it would be nice if netty would tell us which a little about the Socket channel in the error message (maybe there is a way to do this by changing log levels, but as I say I haven’t had time to go digging there) I will probably file a JIRA issue to add the setting (since I can’t see any particular harm to setting keepalive) On Apr 9, 2014, at 1:34 PM, Michael Shuler <mich...@pbandjelly.org> wrote: > On 04/09/2014 12:41 PM, graham sanderson wrote: >> Michael, it is not that the connections are being dropped, it is that >> the connections are not being dropped. > > Thanks for the clarification. > >> These server side sockets are ESTABLISHED, even though the client >> connection on the other side of the network device is long gone. This >> may well be an issue with the network device (it is valiantly trying >> to keep the connection alive it seems). > > Have you tested if they *ever* time out on their own, or do they just keep > sticking around forever? (maybe 432000 sec (120 hours), which is the default > for nf_conntrack_tcp_timeout_established?) Trying out all the usage scenarios > is really the way to track it down - directly on switch, behind/in front of > firewall, on/off the VPN. > >> That said KEEPALIVE on the server side would not be a bad idea. At >> least then the OS on the server would eventually (probably after 2 >> hours of inactivity) attempt to ping the client. At that point >> hopefully something interesting would happen perhaps causing an error >> and destroying the server side socket (note KEEPALIVE is also good >> for preventing idle connections from being dropped by other network >> devices along the way) > > Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they timeout > after 2 hours, which is the default. > >> rpc_keepalive on the server sets keep alive on the server side >> sockets for thrift, and is true by default >> >> There doesn’t seem to be a setting for the native protocol >> >> Note this isn’t a huge issue for us, they can be cleaned up by a >> rolling restart, and this particular case is not production, but >> related to development/testing against alpha by people working >> remotely over VPN - and it may well be the VPNs fault in this case… >> that said and maybe this is a dev list question, it seems like the >> option to set keepalive should exist. > > Yeah, but I agree you shouldn't have to restart to clean up connections - > that's why I think it is lower in the network stack, and that a bit of > troubleshooting and tuning might be helpful. That setting sounds like a good > Jira request - keepalive may be the default, I'm not sure. :) > > -- > Michael > >> On Apr 9, 2014, at 12:25 PM, Michael Shuler <mich...@pbandjelly.org> >> wrote: >> >>> On 04/09/2014 11:39 AM, graham sanderson wrote: >>>> Thanks, but I would think that just sets keep alive from the >>>> client end; I’m talking about the server end… this is one of >>>> those issues where there is something (e.g. switch, firewall, VPN >>>> in between the client and the server) and we get left with >>>> orphaned established connections to the server when the client is >>>> gone. >>> >>> There would be no server setting for any service, not just c*, that >>> would correct mis-configured connection-assassinating network gear >>> between the client and server. Fix the gear to allow persistent >>> connections. >>> >>> Digging through the various timeouts in c*.yaml didn't lead me to a >>> simple answer for something tunable, but I think this may be more >>> basic networking related. I believe it's up to the client to keep >>> the connection open as Duy indicated. I don't think c* will >>> arbitrarily sever connections - something that disconnects the >>> client may happen. In that case, the TCP connection on the server >>> should drop to TIME_WAIT. Is this what you are seeing in `netstat >>> -a` on the server - a bunch of TIME_WAIT connections hanging >>> around? Those should eventually be recycled, but that's tunable in >>> the network stack, if they are being generated at a high rate. >>> >>> -- Michael >> >
smime.p7s
Description: S/MIME cryptographic signature