You're right, we don't account for that in the current design because such
a framework would be relying on disk resources outside of the sandbox.
Currently, we don't have a model for these "persistent" resources (e.g.
disk volume used for HDFS DataNode data). Unlike the existing resources,
persistent resources will not be tied to the lifecycle of the executor/task.

When we have a model for persistent resources, I can see this fitting into
the primitives we are proposing here. Since inverse offers work at the
resource level, we can provide control to the operators to determine
whether the persistent resources should be reclaimed from the framework as
part of the maintenance:

E.g. If decommissioning a machine, the operator can ensure that all
persistent resources are reclaimed. If rebooting a machine, the operator
can leave these resources allocated to the framework for when the machine
is back in the cluster.

Now, since we have the soft deadlines on inverse offers, a framework like
HDFS can determine when it can comply to inverse offers based on the global
data replication state (e.g. always ensure that 2/3 replicas of a block are
available). If relinquishing a particular data volume would mean that only
1 copy of a block is available, the framework can wait to comply with the
inverse offer, or can take steps to create more replicas.

One interesting question is how the resource expiry time will interact with
persistent resources, we may want to expose the expiry time at the resource
level rather than the offer level. Will think about this.

*However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.*


This is explicit in the document under the soft deadline section: the
inverse offer will remain outstanding after the soft deadline elapses, we
won't forcibly drain the task. Anything that's not clear here?




On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou <maxime.brugi...@gmail.com>
wrote:

> Nice work!
>
> First question: don't you think that operations should differentiate short
> and long maintenance?
> I am thinking about frameworks that use persistent storage on disk for
> example. A short maintenance such as a slave reboot or upgrade could be
> done without moving the data to another slave. However decommissioning
> requires to drain the storage too.
>
> If you have an HDFS datanode with 50TB of (replicated) data, you might not
> want to drain it for a reboot (assuming your replication factor is high
> enough) since it takes ages. However for decommission it might make sense
> to drain it.
>
> Not sure if this is a good example but I feel the need to know if the
> maintenance is planned to be short or is forever. I know this does not fit
> the nice modeling you describe :-/
>
> Actually for HDFS we could define a threshold where "good enough"
> replication without the slave would be considered enough and thus we could
> deactivate the slave. This would prevent a rolling restart to go too fast.
> However could you specify that when you drain a slave with hard:false you
> don't enter the drained state even when the deadline has passed if tasks
> are still running? This is not explicit in the document and we want to make
> sure operators have the information about this and could avoid unfortunate
> rolling restarts.
>  On Aug 25, 2014 9:25 PM, "Benjamin Mahler" <benjamin.mah...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I wanted to take a moment to thank Alexandra Sava, who completed her OPW
>> internship this past week. We worked together in the second half of her
>> internship to create a design document for maintenance primitives in Mesos
>> (the original ticket is MESOS-1474
>> <https://issues.apache.org/jira/browse/MESOS-1474>, but the design
>> document is the most up-to-date plan).
>>
>> Maintenance in this context consists of anything that requires the tasks
>> running on the slave to be killed (e.g. kernel upgrades, machine
>> decommissioning, non-recoverable mesos upgrades / configuration changes,
>> etc).
>>
>> The desire is to expose maintenance events to frameworks in a generic
>> manner, as to allow frameworks to respect their SLAs, perform better task
>> placement, and migrate tasks if necessary.
>>
>> The design document is here:
>>
>> https://docs.google.com/document/d/1NjK7MQeJzTRdfZTQ9q1Q5p4dY985bZ7cFqDpX4_fgjM/edit?usp=sharing
>>
>> Please take a moment before the end of next week to go over this design. 
>> *Higher
>> level feedback and questions can be discussed most effectively in this
>> thread.*
>>
>> Let's thank Alexandra for her work!
>>
>> Ben
>>
>

Reply via email to