[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

Joseph Wu (JIRA) Tue, 27 Nov 2018 13:19:08 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701042#comment-16701042
 ]


Joseph Wu commented on MESOS-7564:
----------------------------------

I guess I'll summarize a bit of the discussion that happened in the API WG.

The current plan is to add some regular traffic to any persistent connections 
between agent and executor, so that the connection does not get marked "stale". 
 We want to make a minimal change first, to maintain backwards compatibility 
between new/old agents and new/old executors.  Since there are two persistent 
connections, we want to add Heartbeat Events from Agent to Executor, and 
Heartbeat Calls from Executor to Agent.  Neither agent nor executor will expect 
heartbeats (i.e. they won't disconnect if heartbeats don't appear).  
Unfortunately, in the case of old agents/executors, when they receive an 
unknown Call/Event, they will log a warning.

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7564
>                 URL: https://issues.apache.org/jira/browse/MESOS-7564
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>            Reporter: Anand Mazumdar
>            Assignee: Joseph Wu
>            Priority: Critical
>              Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

Reply via email to