I deal with Java programs running in my executor that spawn various
"service/daemon threads". So, I tend to explicitly call TASK_FINISHED and
call System.exit() (with a sleep to allow Mesos to communicate the task
update) when I know the task is complete instead of waiting for natural
exit of all t
Thanks Alex.
I agree that it looks like it's not mesos-related. It's probably some
dead-lock.
On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov wrote:
> Itamar,
>
> you are right, Mesos executor and containerizer cannot distinguish
> between "busy" and "stuck" processes. However, since you use you
Itamar,
you are right, Mesos executor and containerizer cannot distinguish
between "busy" and "stuck" processes. However, since you use your own
custom executor, you may want to implement a sort of health checks. It
depends on what your task processes are doing.
There are hundreds of reasons why
Alex, Sharma, thanks for your input!
Trying to recreate the issue with a small cluster for the last few days, I
was not able to observe a scenario that I can be sure that my executor sent
the TASK_FINISHED update, but the scheduler did not receive it.
I did observe multiple times a scenario that a
Itamar,
beyond checking master and slave logs, could you pleasse verify your
executor does send the TASK_FINISHED update? You may want to add some
logging and the check executor log. Mesos guarantees the delivery of
status updates, so I suspect the problem is on the executor's side.
On Wed, Jan 2
Have you checked the mesos-slave and mesos-master logs for that task id?
There should be logs in there for task state updates, including FINISHED.
There can be specific cases where sometimes the task status is not reliably
sent to your scheduler (due to mesos-master restarts, leader election
change
I'm using a custom internal framework, loosely based on MesosSubmit.
The phenomenon I'm seeing is something like this:
1. Task X is assigned to slave S.
2. I know this task should run for ~10minutes.
3. On the master dashboard, I see that task X is in the "Running" state for
several *hours*.
4. I S
7 matches
Mail list logo