[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092373#comment-15092373 ] Joseph Wu commented on MESOS-4306: -- In case of random failure, even the master does not know if the machine is gone temporarily (i.e. flaky network) or permanently (i.e. machine exploded). > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088414#comment-15088414 ] Gabriel Hartmann commented on MESOS-4306: - I see, that's fine. As a Framework author I could refer to those endpoints to determine whether or not I should expect an Agent to come back. An endpoint for blacklisting would be sugar, but is by no means urgent. Thanks Joseph. > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088385#comment-15088385 ] Joseph Wu commented on MESOS-4306: -- The {{/maintenance/status}} only returns the machine (i.e. "the machine is DOWN".) . But you can {{GET /maintenance/schedule}} to check if the duration is infinite :) > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088547#comment-15088547 ] Qian Zhang commented on MESOS-4306: --- I think the precondition for framework to refer to those endpoints is the machine has been scheduled maintenance by operator, and then you can see the machine in the response of those endpoints. But what about the case that the agent is down by incident (e.g., hardware issue or out of power)? In this case, framework will receive an {{Event::FAILURE}} but can not know the agent is temporarily down or permanently down. > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088571#comment-15088571 ] Joseph Wu commented on MESOS-4306: -- For random outages, the {{/maintenance/status}} won't change, since only the operator can trigger these changes. When the framework goes to check the machine's status, the machine will either: # Not show up, if it hasn't been scheduled for maintenance # Show up as {{DRAINING}}, if it has been scheduled for maintenance, but not taken down by the operator yet. > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088014#comment-15088014 ] Joseph Wu commented on MESOS-4306: -- I don't think you need another message in this case. With maintenance (in 0.25), an operator can set a unavailability period of infinity to denote the same semantics as {{AGENT_DEAD}} (or rather, {{AGENT_TO_BE_KILLED}}?). The framework would be notified of this in advance via inverse offers. When the agent actually gets terminated (by the operator), the framework will see a {{SLAVE_LOST}} (in HTTP API-land, {{Event::FAILURE}}). Would it help to add maintenance info to {{Event::FAILURE}} too? i.e. In case a machine is taken down before any inverse offers get sent. > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088180#comment-15088180 ] Gabriel Hartmann commented on MESOS-4306: - Adding the information needed to handle the case where a Framework hasn't yet received (may never receive) an inverse Offer seems like the right thing to do. I'm not super familiar with the maintenance primitives. However, the request here is essentially a way to blacklist a node / indicate that it's failed and not coming back. If this is possible through use of the current maintenance primitives (or possible with the minor addition you indicated), then great. However, it might still be nice to have an endpoint which is explicit about what it's going to do. Hitting an http endpoint called /blacklist is pretty clear. Not a huge deal though. > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4306) AGENT_DEAD Message
[ https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088268#comment-15088268 ] Joseph Wu commented on MESOS-4306: -- Yes, this is possible. All agents which are taken down for maintenance are effectively blacklisted. If they attempt to register, they will be told to shut down. As long as the framework has access to the maintenance endpoints, it can call {code} GET /master/maintenance/status {code} This will contain a list of machines that are {{DOWN}} (temporarily or permanently). > AGENT_DEAD Message > -- > > Key: MESOS-4306 > URL: https://issues.apache.org/jira/browse/MESOS-4306 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann > > Frameworks currently receive SLAVE_LOST messages when an Agent fails or is > behind a network partition for some period of time. However frameworks and > indeed Mesos cannot differentiate between an Agent being temporarily or > permanently lost. > It would be good to have a message indicating that an Agent is lost and won't > be returning. This would require human intervention so an endpoint should be > exposed to induce the sending of this message. > This is particularly helpful for frameworks which are waiting for the return > of persistent volumes. In the case where an Agent hosting significant data > (multi terabyte) the framework may be willing to wait a significant amount of > time before repairing its replication factor (for example). Explicit human > provided information about the permanent state of Agents and therefore their > resources would allow these kinds of frameworks to accelerate their recovery > timelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)