[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687389#comment-16687389 ]
Joseph Wu commented on MESOS-7564: ---------------------------------- Historically, we've considered the agent<->executor connection to be reliable. This is evident when you look at the agent's lack of handling for executor disconnections. Currently, if an HTTP executor successfully registers, and then closes its connection, the agent will consider the executor "RUNNING". The agent will then merrily send all sorts of messages over the broken connection (and onto the floor), including LaunchTask messages. The agent might log warnings, but it does not attempt to reconnect (it can't). (The PID executor does not have this problem, because libprocess will make transient connections to send messages if the persistent connection breaks.) If we are considering the agent<->executor connection to be unreliable, we first need to add/test logic to handle executor disconnections. I believe it may be sufficient to detect (even belatedly) disconnections on the agent, and transition the agent's view of the executor from RUNNING to REGISTERING and start the registration timeout. This would only be necessary for HTTP executors. ----- Next to handle cases where the connection is "connected" but dropping packets... We will probably want to add heartbeats in both directions. Just on the HTTP executor library, we have two connections to consider: 1) The SUBSCRIBE Call is one persistent connection where the executor sends one Call, and receives a stream of Events. There is currently no Executor->Agent traffic except the first request. This connection could probably use heartbeating in both directions. Agent->Executor heartbeats may come in the form of Events. Executor->Agent heartbeats will need to be something else (like the heartbeating suggested here: https://reviews.apache.org/r/69183/ ). 2) Other calls go through a secondary connection. This persistent connection is used to send any number of Calls and their subsequent responses (202 Accepted) back. When the executor discovers a disconnection here, it remakes both connections. This connection does not need heartbeating or monitoring. > Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. > ----------------------------------------------------------------------------- > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Reporter: Anand Mazumdar > Assignee: Joseph Wu > Priority: Critical > Labels: api, mesosphere, v1_api > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)