Anand Mazumdar created MESOS-7057:
-------------------------------------

             Summary: Consider using the relink in the executor driver.
                 Key: MESOS-7057
                 URL: https://issues.apache.org/jira/browse/MESOS-7057
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 1.1.0, 1.0.2
            Reporter: Anand Mazumdar
            Assignee: Anand Mazumdar


As outlined in the root cause analysis for MESOS-5332, it is possible for a 
iptables firewall to terminate an idle connection after a timeout. (the default 
is 5 days). Once this happens, the executor driver is not notified of the 
disconnection. It keeps on thinking that it is still connected with the agent.

When the agent process is restarted, the executor still tries to re-use the old 
broken connection to send the re-register message to the agent. This is when it 
eventually realizes that the connection is broken (due to the nature of TCP) 
and calls the {{exited}} callback and commits suicide in 15 minutes upon the 
recovery timeout.

To offset this, an executor should always {{relink}} when it receives a 
reconnect request from the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to