[ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7569:
--------------------------------------

    Assignee: Benjamin Mahler

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> ------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-7569
>                 URL: https://issues.apache.org/jira/browse/MESOS-7569
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to