[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087978#comment-15087978 ]
Gabriel Hartmann commented on MESOS-4306: ----------------------------------------- Yes, I thought it might have overlap with the maintenance primitives as well. You're saying this as a proposed implementation method correct? The AGENT_DEAD message / endpoint would use the same implementation as the maintenance primitives right? > AGENT_DEAD Message > ------------------ > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task > Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)