[ https://issues.apache.org/jira/browse/MESOS-7057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876592#comment-15876592 ]
Stephan Erb commented on MESOS-7057: ------------------------------------ Thanks for fixing this! :-) > Consider using the relink functionality of libprocess in the executor driver. > ----------------------------------------------------------------------------- > > Key: MESOS-7057 > URL: https://issues.apache.org/jira/browse/MESOS-7057 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.0.2, 1.1.0 > Reporter: Anand Mazumdar > Assignee: Anand Mazumdar > Labels: mesosphere > Fix For: 1.2.0 > > > As outlined in the root cause analysis for MESOS-5332, it is possible for a > iptables firewall to terminate an idle connection after a timeout. (the > default is 5 days). Once this happens, the executor driver is not notified of > the disconnection. It keeps on thinking that it is still connected with the > agent. > When the agent process is restarted, the executor still tries to re-use the > old broken connection to send the re-register message to the agent. This is > when it eventually realizes that the connection is broken (due to the nature > of TCP) and calls the {{exited}} callback and commits suicide in 15 minutes > upon the recovery timeout. > To offset this, an executor should always {{relink}} when it receives a > reconnect request from the agent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)