[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092373#comment-15092373 ]
Joseph Wu commented on MESOS-4306: ---------------------------------- In case of random failure, even the master does not know if the machine is gone temporarily (i.e. flaky network) or permanently (i.e. machine exploded). > AGENT_DEAD Message > ------------------ > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task > Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)