[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090847#comment-15090847 ]
Qian Zhang commented on MESOS-4306: ----------------------------------- Right, so for the case 1, the machine will not show up in /maintenance/status, then how can framework know that agent is temporarily down or permanently down? > AGENT_DEAD Message > ------------------ > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task > Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)