[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101757#comment-14101757
 ] 

Benjamin Mahler commented on MESOS-1466:
----------------------------------------

We're going to proceed with a mitigation of this by rejecting tasks once the 
slave is overcommitted:
https://issues.apache.org/jira/browse/MESOS-1721

However, we would also like to ensure that this kind of race is not possible. 
One solution is to use master acknowledgments for executor exits:

(1) When an executor terminates (or the executor could not be launched: 
MESOS-1720), we send an exited executor message.
(2) The master acknowledges these message.
(3) The slave will not accept tasks for unacknowledged terminal executors (this 
must include those executors that could not be launched, per MESOS-1720).

The result of this is that a new executor cannot be launched until the master 
is aware of the old executor exiting.

> Race between executor exited event and launch task can cause overcommit of 
> resources
> ------------------------------------------------------------------------------------
>
>                 Key: MESOS-1466
>                 URL: https://issues.apache.org/jira/browse/MESOS-1466
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>              Labels: reliability
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to