Greg Mann created MESOS-9818:
--------------------------------
Summary: Implement agent-side handling of automatic draining
Key: MESOS-9818
URL: https://issues.apache.org/jira/browse/MESOS-9818
Project: Mesos
Issue Type: Task
Components: agent
Reporter: Greg Mann
The agent needs to be updated to handle automatic draining. This includes the
following:
The agent will have a new handler for the ‘DrainSlaveMessage’:
* ‘Slave::drain()’: checkpoint the drain info
* ‘Slave::_drain()’: Send KILL events for all tasks, with a kill policy
specifying a grace period equal to the minimum of (task kill grace period,
max_grace_period)
The agent’s ‘statusUpdate()’ handler will be updated:
* TASK_KILLED states will be overwritten to TASK_GONE_BY_OPERATOR when the
agent is draining and is being decommissioned
* The AGENT_DRAINING reason will be inserted into all TASK_KILLING,
TASK_KILLED, and TASK_GONE_BY_OPERATOR updates when the agent is draining
* The modified status updates will be checkpointed (instead of the original
ones)
The agent’s recovery code will be updated to ensure that draining is being
performed correctly after failover:
* If the agent is currently draining, it will loop through all tasks and send
KILL events for any tasks whose latest state is not either terminal or
TASK_KILLING.
The agent’s reregistration code will be updated to include the drain info in
the ‘ReregisterSlaveMessage’.
The agent’s v0 ‘/state’ endpoint handler will be updated to include the drain
info.
The agent’s ‘_statusUpdateAcknowledgement()’ and
‘operationStatusAcknowledgement()’ handlers will be updated to check if there
are no active tasks or operations on the agent. If so, and if the agent is
currently draining, then it will wipe the drain info from disk and transition
into the normal, non-draining state.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)