Greg Mann created MESOS-9818:
--------------------------------

             Summary: Implement agent-side handling of automatic draining
                 Key: MESOS-9818
                 URL: https://issues.apache.org/jira/browse/MESOS-9818
             Project: Mesos
          Issue Type: Task
          Components: agent
            Reporter: Greg Mann


The agent needs to be updated to handle automatic draining. This includes the 
following:

The agent will have a new handler for the ‘DrainSlaveMessage’:
* ‘Slave::drain()’: checkpoint the drain info
* ‘Slave::_drain()’: Send KILL events for all tasks, with a kill policy 
specifying a grace period equal to the minimum of (task kill grace period, 
max_grace_period)

The agent’s ‘statusUpdate()’ handler will be updated:
* TASK_KILLED states will be overwritten to TASK_GONE_BY_OPERATOR when the 
agent is draining and is being decommissioned
* The AGENT_DRAINING reason will be inserted into all TASK_KILLING, 
TASK_KILLED, and TASK_GONE_BY_OPERATOR updates when the agent is draining
* The modified status updates will be checkpointed (instead of the original 
ones)

The agent’s recovery code will be updated to ensure that draining is being 
performed correctly after failover:
* If the agent is currently draining, it will loop through all tasks and send 
KILL events for any tasks whose latest state is not either terminal or 
TASK_KILLING.

The agent’s reregistration code will be updated to include the drain info in 
the ‘ReregisterSlaveMessage’.

The agent’s v0 ‘/state’ endpoint handler will be updated to include the drain 
info.

The agent’s ‘_statusUpdateAcknowledgement()’ and 
‘operationStatusAcknowledgement()’ handlers will be updated to check if there 
are no active tasks or operations on the agent. If so, and if the agent is 
currently draining, then it will wipe the drain info from disk and transition 
into the normal, non-draining state.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to