[ 
https://issues.apache.org/jira/browse/MESOS-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8209:
---------------------------------

    Assignee: Vinod Kone

> mesos master should revoke offers when executor state changes
> -------------------------------------------------------------
>
>                 Key: MESOS-8209
>                 URL: https://issues.apache.org/jira/browse/MESOS-8209
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jack Crawford
>            Assignee: Vinod Kone
>
> Currently, the mesos master does not revoke offers when the number of 
> executors on an agent decreases. This is a problem under certain conditions, 
> such when running a workflow that starts lots of small tasks on agents, with 
> a one executor per task model, a master that does not revoke resources after 
> a set amount of time, and a scheduler that does not reject resources.
> The problem is that when running a mono-scheduler framework (which you might 
> want to do to easily enforce authentication requirements, have a full view of 
> all scheduled tasks, etc), in order to respond instantly when new tasks come 
> in I have the scheduler simply hang on to all resource offers it receives, 
> and the master is set to never revoke offers. This way the scheduler always 
> has a pool of resources to quickly service new requests as they come in.
> However, if you start tasks fast enough, the agents can fill up with 
> executors, making it appear as there are no resources available for the 
> scheduler to use. Ive seen this on r4.4xlarge machines on aws with executors 
> that consume 0.1 cpus, 32mb mem where the entire machine will be appear to be 
> filled with executors according to the master resource offers. The executors 
> are exiting (just after the task finishes), but the resources are not 
> reclaimed because the master does not revoke the outstanding resource offers 
> to reflect the change.
> You can replicate this pretty easily if you schedule tasks that finish 
> instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
> tasks this way on a single r4.4xlarge machine, usually 600-700 will finish 
> before all the resource offers to the scheduler fill up and the agent appears 
> to be "full" of executors.
> Changing the scheduler/master to periodically reject/revoke resources fixes 
> the problem.
> My suggestion is for the master to revoke and reissue resource offers when 
> the executor count changes on an agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to