Thanks Michael,

Yup keepalive is not the default. It is possible they are going away after 
nf_conntrack_tcp_timeout_established; will have to do more digging (it is hard 
to tell how old a connection is - there are no visible timers (thru netstat) on 
an ESTABLISHED connection))…

This is actually low on my priority list, I was just spending a bit of time 
trying to track down the source of 

ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 
ErrorMessage.java (line 222) Unexpected exception during request
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
        at 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

errors, which are spamming our server logs quite a lot (I originally thought 
this might be caused by KEEPALIVE, which is when I realized that the 
connections weren’t in keep alive and were building up) - it would be nice if 
netty would tell us which a little about the Socket channel in the error 
message (maybe there is a way to do this by changing log levels, but as I say I 
haven’t had time to go digging there)

I will probably file a JIRA issue to add the setting (since I can’t see any 
particular harm to setting keepalive)

On Apr 9, 2014, at 1:34 PM, Michael Shuler <mich...@pbandjelly.org> wrote:

> On 04/09/2014 12:41 PM, graham sanderson wrote:
>> Michael, it is not that the connections are being dropped, it is that
>> the connections are not being dropped.
> 
> Thanks for the clarification.
> 
>> These server side sockets are ESTABLISHED, even though the client
>> connection on the other side of the network device is long gone. This
>> may well be an issue with the network device (it is valiantly trying
>> to keep the connection alive it seems).
> 
> Have you tested if they *ever* time out on their own, or do they just keep 
> sticking around forever? (maybe 432000 sec (120 hours), which is the default 
> for nf_conntrack_tcp_timeout_established?) Trying out all the usage scenarios 
> is really the way to track it down - directly on switch, behind/in front of 
> firewall, on/off the VPN.
> 
>> That said KEEPALIVE on the server side would not be a bad idea. At
>> least then the OS on the server would eventually (probably after 2
>> hours of inactivity) attempt to ping the client. At that point
>> hopefully something interesting would happen perhaps causing an error
>> and destroying the server side socket (note KEEPALIVE is also good
>> for preventing idle connections from being dropped by other network
>> devices along the way)
> 
> Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they timeout 
> after 2 hours, which is the default.
> 
>> rpc_keepalive on the server sets keep alive on the server side
>> sockets for thrift, and is true by default
>> 
>> There doesn’t seem to be a setting for the native protocol
>> 
>> Note this isn’t a huge issue for us, they can be cleaned up by a
>> rolling restart, and this particular case is not production, but
>> related to development/testing against alpha by people working
>> remotely over VPN - and it may well be the VPNs fault in this case…
>> that said and maybe this is a dev list question, it seems like the
>> option to set keepalive should exist.
> 
> Yeah, but I agree you shouldn't have to restart to clean up connections - 
> that's why I think it is lower in the network stack, and that a bit of 
> troubleshooting and tuning might be helpful. That setting sounds like a good 
> Jira request - keepalive may be the default, I'm not sure. :)
> 
> -- 
> Michael
> 
>> On Apr 9, 2014, at 12:25 PM, Michael Shuler <mich...@pbandjelly.org>
>> wrote:
>> 
>>> On 04/09/2014 11:39 AM, graham sanderson wrote:
>>>> Thanks, but I would think that just sets keep alive from the
>>>> client end; I’m talking about the server end… this is one of
>>>> those issues where there is something (e.g. switch, firewall, VPN
>>>> in between the client and the server) and we get left with
>>>> orphaned established connections to the server when the client is
>>>> gone.
>>> 
>>> There would be no server setting for any service, not just c*, that
>>> would correct mis-configured connection-assassinating network gear
>>> between the client and server. Fix the gear to allow persistent
>>> connections.
>>> 
>>> Digging through the various timeouts in c*.yaml didn't lead me to a
>>> simple answer for something tunable, but I think this may be more
>>> basic networking related. I believe it's up to the client to keep
>>> the connection open as Duy indicated. I don't think c* will
>>> arbitrarily sever connections - something that disconnects the
>>> client may happen. In that case, the TCP connection on the server
>>> should drop to TIME_WAIT. Is this what you are seeing in `netstat
>>> -a` on the server - a bunch of TIME_WAIT connections hanging
>>> around? Those should eventually be recycled, but that's tunable in
>>> the network stack, if they are being generated at a high rate.
>>> 
>>> -- Michael
>> 
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to