Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Niklas Nielsen Tue, 16 Sep 2014 10:55:59 -0700

Okay - that only solves half of the problem for us: users will still see
their frameworks as running even though they completed but it is a first
step.


Let's continue the discussion in a JIRA ticket; I'll create one shortly.

Thanks for helping out!
Niklas


On 15 September 2014 18:17, Benjamin Mahler <benjamin.mah...@gmail.com>
wrote:

> On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen <nik...@mesosphere.io>
> wrote:
>
> > Thanks for your input Ben! (Comments inlined)
> >
> > On 15 September 2014 12:35, Benjamin Mahler <benjamin.mah...@gmail.com>
> > wrote:
> >
> > > To ensure that the architecture of mesos remains a scalable one, we
> want
> > to
> > > persist state in the leaves of the system as much as possible. This is
> > why
> > > the master has never persisted tasks, task states, or status updates.
> > Note
> > > that status updates can contain arbitrarily large amounts of data at
> the
> > > current time (example impact of this:
> > > https://issues.apache.org/jira/browse/MESOS-1746).
> > >
> > > Adam, I think solution (2) you listed above of including a terminal
> > > indicator in the _private_ message between slave and master would
> easily
> > > allow us to release the resources at the correct time. We would still
> > hold
> > > the correct task state in the master, and would maintain the status
> > update
> > > stream invariants for frameworks (guaranteed ordered delivery). This
> > would
> > > be simpler to implement with my recent change here, because you no
> longer
> > > have to remove the task to release the resources:
> > > https://reviews.apache.org/r/25568/
> >
> >
> > "We should still hold the correct task state"; meaning the actual state
> of
> > the task on the slave?
> >
>
> Correct, as in, just maintaining the existing task state semantics in the
> master (for correctness of reconciliation / status update ordering).
>
>
> > Then the auxiliary field should represent "last known" status, which may
> > not necessarily be terminal.
> > For example, a staging update followed by a running update while the
> > framework is disconnected will show as staging still - or am I missing
> > something?
> >
>
> It would still show as staging, the running update would never be sent to
> the master since the slave is not receiving acknowledgements. But if there
> were a terminal task, then the slave would be setting the auxiliary field
> in the StatusUpdateMessage and the master will know to release the
> resources.
>
> + backwards compatibility
>
>
> >
> >
> > >
> > >
> > > Longer term, adding pipelining of status updates would be a nice
> > > improvement (similar to what you listed in (1) above). But as Vinod
> said,
> > > it will require care to ensure we maintain the stream invariants for
> > > frameworks (i.e. probably need to send multiple updates in 1 message).
> > >
> > > How does this sound?
> > >
> >
> > Sounds great - we would love to help out with (2). Would you be up for
> > shepherding such a change?
> >
>
> Yep!
>
> The changes needed should be based off of
> https://reviews.apache.org/r/25568/ since it changes the resource
> releasing
> in the master.
>
>
> >
> >
> > >
> > > On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <a...@mesosphere.io>
> > > wrote:
> > >
> > > > Definitely relevant. If the master could be trusted to persist all
> the
> > > > task status updates, then they could be queued up at the master
> instead
> > > of
> > > > the slave once the master has acknowledged its receipt. Then the
> master
> > > > could have the most up-to-date task state and can recover the
> resources
> > > as
> > > > soon as it receives a terminal update, even if the framework is
> > > > disconnected and unable to receive/ack the status updates. Then, once
> > the
> > > > framework reconnects, the master will be responsible for sending its
> > > queued
> > > > status updates. We will still need a queue on the slave side, but
> only
> > > for
> > > > updates that the master has not persisted and ack'ed, primarily
> during
> > > the
> > > > scenario when the slave is disconnected from the master.
> > > >
> > > > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <vinodk...@gmail.com>
> > > wrote:
> > > >
> > > >> The semantics of these changes would have an impact on the upcoming
> > task
> > > >> reconciliation.
> > > >>
> > > >> @BenM: Can you chime in here on how this fits into the task
> > > reconciliation
> > > >> work that you've been leading?
> > > >>
> > > >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <a...@mesosphere.io>
> > > >> wrote:
> > > >>
> > > >> > I agree with Niklas that if the executor has sent a terminal
> status
> > > >> update
> > > >> > to the slave, then the task is done and the master should be able
> to
> > > >> > recover those resources. Only sending the oldest status update to
> > the
> > > >> > master, especially in the case of framework failover, prevents
> these
> > > >> > resources from being recovered in a timely manner. I see a couple
> of
> > > >> > options for getting around this, each with their own
> disadvantages.
> > > >> > 1) Send the entire status update stream to the master. Once the
> > master
> > > >> sees
> > > >> > the terminal status update, it will removeTask and recover the
> > > >> resources.
> > > >> > Future resends of the update will be forwarded to the scheduler,
> but
> > > the
> > > >> > master will ignore (with warning and invalid_update++ metrics) the
> > > >> > subsequent updates as far as its own state for the removed task is
> > > >> > concerned. Disadvantage: Potentially sends a lot of status update
> > > >> messages
> > > >> > until the scheduler reregisters and acknowledges the updates.
> > > >> > Disadvantage2: Updates could be sent to the scheduler out of order
> > if
> > > >> some
> > > >> > updates are dropped between the slave and master.
> > > >> > 2) Send only the oldest status update to the master, but with an
> > > >> annotation
> > > >> > of the final/terminal state of the task, if any. That way the
> master
> > > can
> > > >> > call removeTask to update its internal state for the task (and
> > update
> > > >> the
> > > >> > UI) and recover the resources for the task. While the scheduler is
> > > still
> > > >> > down, the oldest update will continue to be resent and forwarded,
> > but
> > > >> the
> > > >> > master will ignore the update (with a warning as above) as far as
> > its
> > > >> own
> > > >> > internal state is concerned. When the scheduler reregisters, the
> > > update
> > > >> > stream will be forwarded and acknowledged one-at-a-time as before,
> > > >> > guaranteeing status update ordering to the scheduler.
> Disadvantage:
> > > >> Seems a
> > > >> > bit hacky to tack a terminal state onto a running update.
> > > Disadvantage2:
> > > >> > State endpoint won't show all the status updates until the entire
> > > stream
> > > >> > actually gets forwarded+acknowledged.
> > > >> > Thoughts?
> > > >> >
> > > >> >
> > > >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <vinodk...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > The main reason is to keep status update manager simple. Also,
> it
> > is
> > > >> very
> > > >> > > easy to enforce the order of updates to the master/framework in
> > this
> > > >> > model.
> > > >> > > If we allow multiple updates for a task to be in flight, it's
> > really
> > > >> hard
> > > >> > > (impossible?) to ensure that we are not delivering out-of-order
> > > >> updates
> > > >> > > even in edge cases (failover, network partitions etc).
> > > >> > >
> > > >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen <
> > > nik...@mesosphere.io
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hey Vinod - thanks for chiming in!
> > > >> > > >
> > > >> > > > Is there a particular reason for only having one status in
> > flight?
> > > >> Or
> > > >> > to
> > > >> > > > put it in another way, isn't that too strict behavior taken
> that
> > > the
> > > >> > > master
> > > >> > > > state could present the most recent known state if the status
> > > update
> > > >> > > > manager tried to send more than the front of the stream?
> > > >> > > > Taken very long timeouts, just waiting for those to disappear
> > > seems
> > > >> a
> > > >> > bit
> > > >> > > > tedious and hogs the cluster.
> > > >> > > >
> > > >> > > > Niklas
> > > >> > > >
> > > >> > > > On 10 September 2014 17:18, Vinod Kone <vinodk...@gmail.com>
> > > wrote:
> > > >> > > >
> > > >> > > > > What you observed is expected because of the way the slave
> > > >> > > (specifically,
> > > >> > > > > the status update manager) operates.
> > > >> > > > >
> > > >> > > > > The status update manager only sends the next update for a
> > task
> > > >> if a
> > > >> > > > > previous update (if it exists) has been acked.
> > > >> > > > >
> > > >> > > > > In your case, since TASK_RUNNING was not acked by the
> > framework,
> > > >> > master
> > > >> > > > > doesn't know about the TASK_FINISHED update that is queued
> up
> > by
> > > >> the
> > > >> > > > status
> > > >> > > > > update manager.
> > > >> > > > >
> > > >> > > > > If the framework never comes back, i.e., failover timeout
> > > elapses,
> > > >> > > master
> > > >> > > > > shuts down the framework, which releases those resources.
> > > >> > > > >
> > > >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen <
> > > >> > nik...@mesosphere.io>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Here is the log of a mesos-local instance where I
> reproduced
> > > it:
> > > >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here
> task
> > > 10
> > > >> to
> > > >> > 19
> > > >> > > > are
> > > >> > > > > > stuck in running state).
> > > >> > > > > > There is a lot of output, so here is a filtered log for
> task
> > > 10:
> > > >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> > > >> > > > > >
> > > >> > > > > > At first glance, it looks like the task can't be found
> when
> > > >> trying
> > > >> > to
> > > >> > > > > > forward the finish update because the running update never
> > got
> > > >> > > > > acknowledged
> > > >> > > > > > before the framework disconnected. I may be missing
> > something
> > > >> here.
> > > >> > > > > >
> > > >> > > > > > Niklas
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On 10 September 2014 16:09, Niklas Nielsen <
> > > >> nik...@mesosphere.io>
> > > >> > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi guys,
> > > >> > > > > > >
> > > >> > > > > > > We have run into a problem that cause tasks which
> > completes,
> > > >> > when a
> > > >> > > > > > > framework is disconnected and has a fail-over time, to
> > > remain
> > > >> in
> > > >> > a
> > > >> > > > > > running
> > > >> > > > > > > state even though the tasks actually finishes.
> > > >> > > > > > >
> > > >> > > > > > > Here is a test framework we have been able to reproduce
> > the
> > > >> issue
> > > >> > > > with:
> > > >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54
> > > >> > > > > > > It launches many short-lived tasks (1 second sleep) and
> > when
> > > >> > > killing
> > > >> > > > > the
> > > >> > > > > > > framework instance, the master reports the tasks as
> > running
> > > >> even
> > > >> > > > after
> > > >> > > > > > > several minutes:
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> > > >> > > > > > >
> > > >> > > > > > > When clicking on one of the slaves where, for example,
> > task
> > > 49
> > > >> > > runs;
> > > >> > > > > the
> > > >> > > > > > > slave knows that it completed:
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> > > >> > > > > > >
> > > >> > > > > > > The tasks only finish when the framework connects again
> > > >> (which it
> > > >> > > may
> > > >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to
> > HEAD
> > > >> (as
> > > >> > of
> > > >> > > > > > today).
> > > >> > > > > > > Do you guys have any insights into what may be going on
> > > here?
> > > >> Is
> > > >> > > this
> > > >> > > > > > > by-design or a bug?
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > > Niklas
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

Reply via email to