Benno— you may be seeing MESOS-4111 <https://issues.apache.org/jira/browse/MESOS-4111>. Also, have a look at this comment: https://github.com/apache/mesos/blob/9f472b1eff904d0d96063d3bed535a8e81263d69/src/launcher/executor.cpp#L611-L617
On Tue, May 3, 2016 at 2:49 PM, Evers Benno <ben...@yandex-team.ru> wrote: > Hi, > > I was wondering about the semantics of the Executor::sendStatusUpdate() > method. It is described as > > // Sends a status update to the framework scheduler, retrying as > // necessary until an acknowledgement has been received or the > // executor is terminated (in which case, a TASK_LOST status update > // will be sent). See Scheduler::statusUpdate for more information > // about status update acknowledgements. > > I was understanding this to say that the function blocks until an > acknowledgement is received, but looking at the implementation of > MesosExecutor it seems that "retrying as necessary" only means > re-sending of unacknowledged updates when the slave reconnects. > (exec/exec.cpp:274) > > I'm wondering because we're currently running a python executor which > ends its life like this: > > driver.sendStatusUpdate(_create_task_status(TASK_FINISHED)) > driver.stop() > # in a different thread: > sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1) > > and we're seeing situations (roughly once per 10,000 tasks) where the > executor process is reaped before the acknowledgement for TASK_FINISHED > is sent from slave to executor. This results in mesos generating a > TASK_FAILED status update, probably from > Slave::sendExecutorTerminatedStatusUpdate(). > > So, did I misunderstand how MesosExecutor works? Or is it indeed a race, > and we have to change the executor shutdown? > > Best regards, > Benno >