Chatted offline with Chun and Meng and suggested we take an explicit approach of using process::Sequence to ensure ordered task delivery (this would need to be done both in the master and agent).
On Thu, Mar 1, 2018 at 1:17 PM, Chun-Hung Hsiao <chhs...@mesosphere.io> wrote: > Some background for the bug AlexR and Meng found: > > In https://issues.apache.org/jira/browse/MESOS-1720, > we introduce an ordering dependency between two `LAUNCH`/`LAUNCH_GROUP` > calls to a new executor. > The master would specify that the first call is the one to launch a new > executor > through the `launch_executor` field in > `RunTaskMessage`/`RunTaskGroupMessage`, > and the second one should use the existing executor launched by the first > call. > At the agent side, it will drop any task that want to launch an executor > which is already existing, > or any task that want to run on a non-existent executor. > > Running a task/task group goes through a series of continuations, > one is `collect()` on the future that unschedule frameworks from being > GC'ed: > https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2158 > another is `collect()` on task authorization: > https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2333 > Since these `collect()` calls run on individual actors, the futures of the > `collect()` calls for > two `LAUNCH`/`LAUNCH_GROUP` calls may returns out-of-order, > even if the futures these two `collect()` wait for are satisfied in order > (which is true). > > The result is that, if this race condition is triggered, > the agent will try to run the second task/task group before the first one, > and since the executor is supposed to be launched by the first one, > the agent will end up sending `TASK_DROPPED` for the second call. > > If we can have an interface to make sure that `collect()` returns in the > same order > of their dependent futures, this can be avoided. > > On Mar 1, 2018 12:50 PM, "Benjamin Mahler" <bmah...@apache.org> wrote: > > > Could you explain the problem in more detail? > > > > On Thu, Mar 1, 2018 at 12:15 PM Chun-Hung Hsiao <chhs...@mesosphere.io> > > wrote: > > > > > Hi all, > > > > > > Meng found a bug in `slave.cpp`, where the proper fix requires > collecting > > > futures in order. Currently every `collect` call spawns it's own actor, > > so > > > for two `collect` calls, even though their futures are satisfied in > > order, > > > they may finish out-of-order. So we need some libprocess changes to > have > > > the ability to collect futures in the same actor. Here I have two > > > proposals: > > > > > > 1. Add a new `collect` interface that takes an actor as a parameter. > > > > > > 2. Introduce `process::Executor::collect()` for this. > > > > > > Any opinion on these two options? > > > > > >