Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Adam Bordelon Thu, 11 Sep 2014 12:03:28 -0700

Definitely relevant. If the master could be trusted to persist all the task
status updates, then they could be queued up at the master instead of the
slave once the master has acknowledged its receipt. Then the master could
have the most up-to-date task state and can recover the resources as soon
as it receives a terminal update, even if the framework is disconnected and
unable to receive/ack the status updates. Then, once the framework
reconnects, the master will be responsible for sending its queued status
updates. We will still need a queue on the slave side, but only for updates
that the master has not persisted and ack'ed, primarily during the scenario
when the slave is disconnected from the master.


On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <vinodk...@gmail.com> wrote:

> The semantics of these changes would have an impact on the upcoming task
> reconciliation.
>
> @BenM: Can you chime in here on how this fits into the task reconciliation
> work that you've been leading?
>
> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <a...@mesosphere.io> wrote:
>
> > I agree with Niklas that if the executor has sent a terminal status
> update
> > to the slave, then the task is done and the master should be able to
> > recover those resources. Only sending the oldest status update to the
> > master, especially in the case of framework failover, prevents these
> > resources from being recovered in a timely manner. I see a couple of
> > options for getting around this, each with their own disadvantages.
> > 1) Send the entire status update stream to the master. Once the master
> sees
> > the terminal status update, it will removeTask and recover the resources.
> > Future resends of the update will be forwarded to the scheduler, but the
> > master will ignore (with warning and invalid_update++ metrics) the
> > subsequent updates as far as its own state for the removed task is
> > concerned. Disadvantage: Potentially sends a lot of status update
> messages
> > until the scheduler reregisters and acknowledges the updates.
> > Disadvantage2: Updates could be sent to the scheduler out of order if
> some
> > updates are dropped between the slave and master.
> > 2) Send only the oldest status update to the master, but with an
> annotation
> > of the final/terminal state of the task, if any. That way the master can
> > call removeTask to update its internal state for the task (and update the
> > UI) and recover the resources for the task. While the scheduler is still
> > down, the oldest update will continue to be resent and forwarded, but the
> > master will ignore the update (with a warning as above) as far as its own
> > internal state is concerned. When the scheduler reregisters, the update
> > stream will be forwarded and acknowledged one-at-a-time as before,
> > guaranteeing status update ordering to the scheduler. Disadvantage:
> Seems a
> > bit hacky to tack a terminal state onto a running update. Disadvantage2:
> > State endpoint won't show all the status updates until the entire stream
> > actually gets forwarded+acknowledged.
> > Thoughts?
> >
> >
> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <vinodk...@gmail.com> wrote:
> >
> > > The main reason is to keep status update manager simple. Also, it is
> very
> > > easy to enforce the order of updates to the master/framework in this
> > model.
> > > If we allow multiple updates for a task to be in flight, it's really
> hard
> > > (impossible?) to ensure that we are not delivering out-of-order updates
> > > even in edge cases (failover, network partitions etc).
> > >
> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen <nik...@mesosphere.io>
> > > wrote:
> > >
> > > > Hey Vinod - thanks for chiming in!
> > > >
> > > > Is there a particular reason for only having one status in flight? Or
> > to
> > > > put it in another way, isn't that too strict behavior taken that the
> > > master
> > > > state could present the most recent known state if the status update
> > > > manager tried to send more than the front of the stream?
> > > > Taken very long timeouts, just waiting for those to disappear seems a
> > bit
> > > > tedious and hogs the cluster.
> > > >
> > > > Niklas
> > > >
> > > > On 10 September 2014 17:18, Vinod Kone <vinodk...@gmail.com> wrote:
> > > >
> > > > > What you observed is expected because of the way the slave
> > > (specifically,
> > > > > the status update manager) operates.
> > > > >
> > > > > The status update manager only sends the next update for a task if
> a
> > > > > previous update (if it exists) has been acked.
> > > > >
> > > > > In your case, since TASK_RUNNING was not acked by the framework,
> > master
> > > > > doesn't know about the TASK_FINISHED update that is queued up by
> the
> > > > status
> > > > > update manager.
> > > > >
> > > > > If the framework never comes back, i.e., failover timeout elapses,
> > > master
> > > > > shuts down the framework, which releases those resources.
> > > > >
> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <
> > nik...@mesosphere.io>
> > > > > wrote:
> > > > >
> > > > > > Here is the log of a mesos-local instance where I reproduced it:
> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10
> to
> > 19
> > > > are
> > > > > > stuck in running state).
> > > > > > There is a lot of output, so here is a filtered log for task 10:
> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> > > > > >
> > > > > > At first glance, it looks like the task can't be found when
> trying
> > to
> > > > > > forward the finish update because the running update never got
> > > > > acknowledged
> > > > > > before the framework disconnected. I may be missing something
> here.
> > > > > >
> > > > > > Niklas
> > > > > >
> > > > > >
> > > > > > On 10 September 2014 16:09, Niklas Nielsen <nik...@mesosphere.io
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > We have run into a problem that cause tasks which completes,
> > when a
> > > > > > > framework is disconnected and has a fail-over time, to remain
> in
> > a
> > > > > > running
> > > > > > > state even though the tasks actually finishes.
> > > > > > >
> > > > > > > Here is a test framework we have been able to reproduce the
> issue
> > > > with:
> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> > > > > > > It launches many short-lived tasks (1 second sleep) and when
> > > killing
> > > > > the
> > > > > > > framework instance, the master reports the tasks as running
> even
> > > > after
> > > > > > > several minutes:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> > > > > > >
> > > > > > > When clicking on one of the slaves where, for example, task 49
> > > runs;
> > > > > the
> > > > > > > slave knows that it completed:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> > > > > > >
> > > > > > > The tasks only finish when the framework connects again (which
> it
> > > may
> > > > > > > never do). This is on Mesos 0.20.0, but also applies to HEAD
> (as
> > of
> > > > > > today).
> > > > > > > Do you guys have any insights into what may be going on here?
> Is
> > > this
> > > > > > > by-design or a bug?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Niklas
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Reply via email to