What you observed is expected because of the way the slave (specifically,
the status update manager) operates.

The status update manager only sends the next update for a task if a
previous update (if it exists) has been acked.

In your case, since TASK_RUNNING was not acked by the framework, master
doesn't know about the TASK_FINISHED update that is queued up by the status
update manager.

If the framework never comes back, i.e., failover timeout elapses, master
shuts down the framework, which releases those resources.

On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <nik...@mesosphere.io>
wrote:

> Here is the log of a mesos-local instance where I reproduced it:
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10:
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
>
> At first glance, it looks like the task can't be found when trying to
> forward the finish update because the running update never got acknowledged
> before the framework disconnected. I may be missing something here.
>
> Niklas
>
>
> On 10 September 2014 16:09, Niklas Nielsen <nik...@mesosphere.io> wrote:
>
> > Hi guys,
> >
> > We have run into a problem that cause tasks which completes, when a
> > framework is disconnected and has a fail-over time, to remain in a
> running
> > state even though the tasks actually finishes.
> >
> > Here is a test framework we have been able to reproduce the issue with:
> > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> > It launches many short-lived tasks (1 second sleep) and when killing the
> > framework instance, the master reports the tasks as running even after
> > several minutes:
> >
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> >
> > When clicking on one of the slaves where, for example, task 49 runs; the
> > slave knows that it completed:
> >
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> >
> > The tasks only finish when the framework connects again (which it may
> > never do). This is on Mesos 0.20.0, but also applies to HEAD (as of
> today).
> > Do you guys have any insights into what may be going on here? Is this
> > by-design or a bug?
> >
> > Thanks,
> > Niklas
> >
>

Reply via email to