[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701042#comment-16701042 ]
Joseph Wu commented on MESOS-7564: ---------------------------------- I guess I'll summarize a bit of the discussion that happened in the API WG. The current plan is to add some regular traffic to any persistent connections between agent and executor, so that the connection does not get marked "stale". We want to make a minimal change first, to maintain backwards compatibility between new/old agents and new/old executors. Since there are two persistent connections, we want to add Heartbeat Events from Agent to Executor, and Heartbeat Calls from Executor to Agent. Neither agent nor executor will expect heartbeats (i.e. they won't disconnect if heartbeats don't appear). Unfortunately, in the case of old agents/executors, when they receive an unknown Call/Event, they will log a warning. > Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. > ----------------------------------------------------------------------------- > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Reporter: Anand Mazumdar > Assignee: Joseph Wu > Priority: Critical > Labels: api, mesosphere, v1_api > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)