[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-7569: ----------------------------------- Fix Version/s: 1.2.2 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > ------------------------------------------------------------------------------------------------ > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)