[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

Gabriel Hartmann (JIRA) Thu, 07 Jan 2016 13:47:02 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088180#comment-15088180
 ]


Gabriel Hartmann commented on MESOS-4306:
-----------------------------------------

Adding the information needed to handle the case where a Framework hasn't yet 
received (may never receive) an inverse Offer seems like the right thing to do.

I'm not super familiar with the maintenance primitives.  However, the request 
here is essentially a way to blacklist a node / indicate that it's failed and 
not coming back.  If this is possible through use of the current maintenance 
primitives (or possible with the minor addition you indicated), then great.

However, it might still be nice to have an endpoint which is explicit about 
what it's going to do.  Hitting an http endpoint called /blacklist is pretty 
clear.   Not a huge deal though.

> AGENT_DEAD Message
> ------------------
>
>                 Key: MESOS-4306
>                 URL: https://issues.apache.org/jira/browse/MESOS-4306
>             Project: Mesos
>          Issue Type: Task
>            Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

Reply via email to