Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Vinod Kone Wed, 10 Sep 2014 17:56:59 -0700

The main reason is to keep status update manager simple. Also, it is very
easy to enforce the order of updates to the master/framework in this model.
If we allow multiple updates for a task to be in flight, it's really hard
(impossible?) to ensure that we are not delivering out-of-order updates
even in edge cases (failover, network partitions etc).


On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen <nik...@mesosphere.io>
wrote:

> Hey Vinod - thanks for chiming in!
>
> Is there a particular reason for only having one status in flight? Or to
> put it in another way, isn't that too strict behavior taken that the master
> state could present the most recent known state if the status update
> manager tried to send more than the front of the stream?
> Taken very long timeouts, just waiting for those to disappear seems a bit
> tedious and hogs the cluster.
>
> Niklas
>
> On 10 September 2014 17:18, Vinod Kone <vinodk...@gmail.com> wrote:
>
> > What you observed is expected because of the way the slave (specifically,
> > the status update manager) operates.
> >
> > The status update manager only sends the next update for a task if a
> > previous update (if it exists) has been acked.
> >
> > In your case, since TASK_RUNNING was not acked by the framework, master
> > doesn't know about the TASK_FINISHED update that is queued up by the
> status
> > update manager.
> >
> > If the framework never comes back, i.e., failover timeout elapses, master
> > shuts down the framework, which releases those resources.
> >
> > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <nik...@mesosphere.io>
> > wrote:
> >
> > > Here is the log of a mesos-local instance where I reproduced it:
> > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19
> are
> > > stuck in running state).
> > > There is a lot of output, so here is a filtered log for task 10:
> > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> > >
> > > At first glance, it looks like the task can't be found when trying to
> > > forward the finish update because the running update never got
> > acknowledged
> > > before the framework disconnected. I may be missing something here.
> > >
> > > Niklas
> > >
> > >
> > > On 10 September 2014 16:09, Niklas Nielsen <nik...@mesosphere.io>
> wrote:
> > >
> > > > Hi guys,
> > > >
> > > > We have run into a problem that cause tasks which completes, when a
> > > > framework is disconnected and has a fail-over time, to remain in a
> > > running
> > > > state even though the tasks actually finishes.
> > > >
> > > > Here is a test framework we have been able to reproduce the issue
> with:
> > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> > > > It launches many short-lived tasks (1 second sleep) and when killing
> > the
> > > > framework instance, the master reports the tasks as running even
> after
> > > > several minutes:
> > > >
> > >
> >
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> > > >
> > > > When clicking on one of the slaves where, for example, task 49 runs;
> > the
> > > > slave knows that it completed:
> > > >
> > >
> >
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> > > >
> > > > The tasks only finish when the framework connects again (which it may
> > > > never do). This is on Mesos 0.20.0, but also applies to HEAD (as of
> > > today).
> > > > Do you guys have any insights into what may be going on here? Is this
> > > > by-design or a bug?
> > > >
> > > > Thanks,
> > > > Niklas
> > > >
> > >
> >
>

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Reply via email to