[ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Artem Harutyunyan reassigned MESOS-1474: ---------------------------------------- Assignee: Artem Harutyunyan > Provide cluster maintenance primitives for operators. > ----------------------------------------------------- > > Key: MESOS-1474 > URL: https://issues.apache.org/jira/browse/MESOS-1474 > Project: Mesos > Issue Type: Epic > Components: framework, master, slave > Reporter: Benjamin Mahler > Assignee: Artem Harutyunyan > Labels: mesosphere, twitter > > Sometimes operators need to perform maintenance on a mesos cluster; we define > maintenance here as anything that requires the tasks to be drained on the > slave(s). Most mesos upgrades can be done without affecting running tasks, > but there are situations where maintenance is task-affecting: > * Host maintenance (e.g. hardware repair, kernel upgrades). > * Non-recoverable slave upgrades (e.g. adjusting slave attributes). > * etc > In order to ensure operators don’t violate frameworks’ SLAs, schedulers need > to be aware of planned unavailability events. > Maintenance awareness allows schedulers to avoid churn for long running tasks > by placing them on machines not undergoing maintenance. If all resources are > planned for maintenance, then the scheduler will prefer machines scheduled > for maintenance least imminently. > Maintenance awareness is also crucial when a scheduler uses [persistent > disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure > that the scheduler is aware of the expected duration of unavailability for a > persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate > 1TB over the network when only 1 of the 3 replicas is going to be unavailable > for a reboot (< 1 hour)). > There are a few primitives of interest here: > * Provide a way for operators to [fully shutdown a > slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks > underneath it). Colloquially known as a "hard drain". > * Provide a way for operators to mark specific slaves as scheduled for > maintenance. This will inform the scheduler about the scheduled > unavailability of the resources. > * Provide a way for frameworks to be notified when resources are requested to > be relinquished. This gives the framework to proactively move a task before > it may be forcibly killed by an operator. It also allows the automation of > operations like: "please drain these slaves within 1 hour." > See the [design > doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#] > for the latest details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)