[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2017-05-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025600#comment-16025600
 ] 

Benjamin Mahler commented on MESOS-5361:


Linking in the executor related tickets that came up due to conntrack 
considering connections stale after 5 days.

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2016-05-14 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283605#comment-15283605
 ] 

haosdent commented on MESOS-5361:
-

I see. XD

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2016-05-14 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283593#comment-15283593
 ] 

Anand Mazumdar commented on MESOS-5361:
---

- +1 
- I was alluding to them as a joke due to the values being of little use in a 
real world application and not to Linux's implementation of the age old RFC.

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are a joke though:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2016-05-14 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283537#comment-15283537
 ] 

haosdent commented on MESOS-5361:
-

{quote}
 Also, operators might start tuning the values for their clusters explicitly 
once we start supporting it.
{quote}

We could reduce the default {{TCP_KEEPIDLE}}, {{TCP_KEEPINTVL}} and 
{{TCP_KEEPCNT}} when create the socket instead of operators to change the Linux 
default configuration. 

{quote}
The default TCP keep alive values on Linux are a joke though
{quote}

Actually it is not a joke, the RFC 1122 standard defined the minimum timeout 
should not less than 2 hours while we use few seconds instead in the real 
world...

{code}
4.2.3.6  TCP Keep-Alives

Implementors MAY include "keep-alives" in their TCP
implementations, although this practice is not universally
accepted.  If keep-alives are included, the application MUST
be able to turn them on or off for each TCP connection, and
they MUST default to off.

Keep-alive packets MUST only be sent when no data or
acknowledgement packets have been received for the
connection within an interval.  This interval MUST be
configurable and MUST default to no less than two hours.
{code}

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are a joke though:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)