[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-11 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092373#comment-15092373
 ] 

Joseph Wu commented on MESOS-4306:
--

In case of random failure, even the master does not know if the machine is gone 
temporarily (i.e. flaky network) or permanently (i.e. machine exploded).  

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Gabriel Hartmann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088414#comment-15088414
 ] 

Gabriel Hartmann commented on MESOS-4306:
-

I see, that's fine.  As a Framework author I could refer to those endpoints to 
determine whether or not I should expect an Agent to come back.  An endpoint 
for blacklisting would be sugar, but is by no means urgent.  Thanks Joseph.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088385#comment-15088385
 ] 

Joseph Wu commented on MESOS-4306:
--

The {{/maintenance/status}} only returns the machine (i.e. "the machine is 
DOWN".) .  But you can {{GET /maintenance/schedule}} to check if the duration 
is infinite :)

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088547#comment-15088547
 ] 

Qian Zhang commented on MESOS-4306:
---

I think the precondition for framework to refer to those endpoints is the 
machine has been scheduled maintenance by operator, and then you can see the 
machine in the response of those endpoints. But what about the case that the 
agent is down by incident (e.g., hardware issue or out of power)? In this case, 
framework will receive an {{Event::FAILURE}} but can not know the agent is 
temporarily down or permanently down.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088571#comment-15088571
 ] 

Joseph Wu commented on MESOS-4306:
--

For random outages, the {{/maintenance/status}} won't change, since only the 
operator can trigger these changes.  

When the framework goes to check the machine's status, the machine will either:
# Not show up, if it hasn't been scheduled for maintenance
# Show up as {{DRAINING}}, if it has been scheduled for maintenance, but not 
taken down by the operator yet.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088014#comment-15088014
 ] 

Joseph Wu commented on MESOS-4306:
--

I don't think you need another message in this case.

With maintenance (in 0.25), an operator can set a unavailability period of 
infinity to denote the same semantics as {{AGENT_DEAD}} (or rather, 
{{AGENT_TO_BE_KILLED}}?).  The framework would be notified of this in advance 
via inverse offers.

When the agent actually gets terminated (by the operator), the framework will 
see a {{SLAVE_LOST}} (in HTTP API-land, {{Event::FAILURE}}).

Would it help to add maintenance info to {{Event::FAILURE}} too?  i.e. In case 
a machine is taken down before any inverse offers get sent.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Gabriel Hartmann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088180#comment-15088180
 ] 

Gabriel Hartmann commented on MESOS-4306:
-

Adding the information needed to handle the case where a Framework hasn't yet 
received (may never receive) an inverse Offer seems like the right thing to do.

I'm not super familiar with the maintenance primitives.  However, the request 
here is essentially a way to blacklist a node / indicate that it's failed and 
not coming back.  If this is possible through use of the current maintenance 
primitives (or possible with the minor addition you indicated), then great.

However, it might still be nice to have an endpoint which is explicit about 
what it's going to do.  Hitting an http endpoint called /blacklist is pretty 
clear.   Not a huge deal though.

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4306) AGENT_DEAD Message

2016-01-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088268#comment-15088268
 ] 

Joseph Wu commented on MESOS-4306:
--

Yes, this is possible.  All agents which are taken down for maintenance are 
effectively blacklisted.  If they attempt to register, they will be told to 
shut down. 

As long as the framework has access to the maintenance endpoints, it can call
{code}
GET /master/maintenance/status
{code}

This will contain a list of machines that are {{DOWN}} (temporarily or 
permanently).

> AGENT_DEAD Message
> --
>
> Key: MESOS-4306
> URL: https://issues.apache.org/jira/browse/MESOS-4306
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)