[ 
https://issues.apache.org/jira/browse/MESOS-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639645#comment-16639645
 ] 

Matthew Mead-Briggs commented on MESOS-8933:
--------------------------------------------

Any updated thoughts on this? Since [~sagar8192] first posted this we've 
decided to move forward with this option. And it's been running safely on 
several clusters at Yelp for some time now. It seems in practice it's going to 
be a long time before all the frameworks support inverse offers. And so in the 
short term at least the best strategy for us is to have the master not send 
offers for hosts that are draining. Then it is up to us to know how this 
affects currently running tasks for each framework.

It turns out that we do need to recover the resources from the allocator so the 
patch ended up looking more like this:

[https://gist.github.com/mattmb/88859c4a40b655d8be8bbd2d59204cf5]

I agree that the best thing would be for frameworks to update and support 
maintenance natively but I think it would be worth having this option upstream 
behind a config flag as suggested.

 

> Stop sending offers from agents in draining mode
> ------------------------------------------------
>
>                 Key: MESOS-8933
>                 URL: https://issues.apache.org/jira/browse/MESOS-8933
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Sagar Sadashiv Patwardhan
>            Priority: Minor
>
> *Background:*
> At Yelp, we use mesos to run microservices(marathon), batch jobs(chronos and 
> custom frameworks), spark(spark mesos framework) etc.  We also autoscale the 
> number of agents in our cluster based on the current demand and some other 
> metrics. We use mesos maintenance primitives to gracefully shut down mesos 
> agents. 
> *Problem:*
> When we want to shut down an agent for some reason, we first move the agent 
> into draining mode. This allows us to gracefully terminate the micro-services 
> and other tasks. But, mesos continues to send offers from that agent with 
> unavailability set. Frameworks such as marathon, chronos, and spark ignore 
> the unavailability and schedule the tasks on the agent. To prevent this from 
> happening, we allocate all the available resources on that agent to a role 
> that is not used by any framework. But, this approach is not fool-proof. 
> There is still a race condition between when we move the agent into draining 
> mode and when we allocate all the available resources on the agent to 
> maintenance role.
> *Proposal:*
>  It would be nice if mesos stops sending offers from the agents in draining 
> mode. Something like this: 
> [https://gist.github.com/sagar8192/0b9dbccc908818f8f9f5a18d1f634513] I don't 
> know if this affects the allocator or not. We can put this behind a 
> flag(something like --do-not-send-offers-from-agents-in-draining-mode) and 
> make it optional.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to