[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025600#comment-16025600
 ] 

Benjamin Mahler commented on MESOS-5361:
----------------------------------------

Linking in the executor related tickets that came up due to conntrack 
considering connections stale after 5 days.

> Consider introducing TCP KeepAlive for Libprocess sockets.
> ----------------------------------------------------------
>
>                 Key: MESOS-5361
>                 URL: https://issues.apache.org/jira/browse/MESOS-5361
>             Project: Mesos
>          Issue Type: Improvement
>          Components: libprocess
>            Reporter: Anand Mazumdar
>              Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to