[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan reassigned MESOS-1474:
----------------------------------------

    Assignee: Artem Harutyunyan

> Provide cluster maintenance primitives for operators.
> -----------------------------------------------------
>
>                 Key: MESOS-1474
>                 URL: https://issues.apache.org/jira/browse/MESOS-1474
>             Project: Mesos
>          Issue Type: Epic
>          Components: framework, master, slave
>            Reporter: Benjamin Mahler
>            Assignee: Artem Harutyunyan
>              Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define 
> maintenance here as anything that requires the tasks to be drained on the 
> slave(s). Most mesos upgrades can be done without affecting running tasks, 
> but there are situations where maintenance is task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need 
> to be aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks 
> by placing them on machines not undergoing maintenance. If all resources are 
> planned for maintenance, then the scheduler will prefer machines scheduled 
> for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent 
> disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure 
> that the scheduler is aware of the expected duration of unavailability for a 
> persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 
> 1TB over the network when only 1 of the 3 replicas is going to be unavailable 
> for a reboot (< 1 hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a 
> slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks 
> underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for 
> maintenance. This will inform the scheduler about the scheduled 
> unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to 
> be relinquished. This gives the framework to proactively move a task before 
> it may be forcibly killed by an operator. It also allows the automation of 
> operations like: "please drain these slaves within 1 hour."
> See the [design 
> doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
>  for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to