Benno—

you may be seeing MESOS-4111
<https://issues.apache.org/jira/browse/MESOS-4111>. Also, have a look at
this comment:
https://github.com/apache/mesos/blob/9f472b1eff904d0d96063d3bed535a8e81263d69/src/launcher/executor.cpp#L611-L617

On Tue, May 3, 2016 at 2:49 PM, Evers Benno <ben...@yandex-team.ru> wrote:

> Hi,
>
> I was wondering about the semantics of the Executor::sendStatusUpdate()
> method. It is described as
>
>     // Sends a status update to the framework scheduler, retrying as
>     // necessary until an acknowledgement has been received or the
>     // executor is terminated (in which case, a TASK_LOST status update
>     // will be sent). See Scheduler::statusUpdate for more information
>     // about status update acknowledgements.
>
> I was understanding this to say that the function blocks until an
> acknowledgement is received, but looking at the implementation of
> MesosExecutor it seems that "retrying as necessary" only means
> re-sending of unacknowledged updates when the slave reconnects.
> (exec/exec.cpp:274)
>
> I'm wondering because we're currently running a python executor which
> ends its life like this:
>
>     driver.sendStatusUpdate(_create_task_status(TASK_FINISHED))
>     driver.stop()
>     # in a different thread:
>     sys.exit(0 if driver.run() == mesos_pb2.DRIVER_STOPPED else 1)
>
> and we're seeing situations (roughly once per 10,000 tasks) where the
> executor process is reaped before the acknowledgement for TASK_FINISHED
> is sent from slave to executor. This results in mesos generating a
> TASK_FAILED status update, probably from
> Slave::sendExecutorTerminatedStatusUpdate().
>
> So, did I misunderstand how MesosExecutor works? Or is it indeed a race,
> and we have to change the executor shutdown?
>
> Best regards,
> Benno
>

Reply via email to