Hi all: TLDR: In Mesos 1.5, tasks may be explicitly dropped by the agent if all three conditions are met: (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls use the same executor. (2) The executor currently does not exist on the agent. (3) Due to some race conditions, these tasks are trying to launch on the agent in a different order from their original launch order.
In this case, tasks that are trying to launch on the agent before the first task in the original order will be explicitly dropped by the agent (TASK_DROPPED` or `TASK_LOST` will be sent)). This bug will be fixed in 1.5.1. It is tracked in https://issues.apache.org/jira/browse/MESOS-8624 ---- In https://issues.apache.org/jira/browse/MESOS-1720, we introduced an ordering dependency between two `LAUNCH`/`LAUNCH_GROUP` calls to a new executor. The master would specify that the first call is the one to launch a new executor through the `launch_executor` field in `RunTaskMessage`/`RunTaskGroupMessage`, and the second one should use the existing executor launched by the first one. On the agent side, running a task/task group goes through a series of continuations, one is `collect()` on the future that unschedule frameworks from being GC'ed: https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158 another is `collect()` on task authorization: https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333 Since these `collect()` calls run on individual actors, the futures of the `collect()` calls for two `LAUNCH`/`LAUNCH_GROUP` calls may return out-of-order, even if the futures these two `collect()` wait for are satisfied in order (which is true in these two cases). As a result, under some race conditions (probably under some heavy load conditions), tasks rely on the previous task to launch executor may get processed before the task that is supposed to launch the executor first, resulting in the tasks being explicitly dropped by the agent. -Meng