Re: Trying to get task reconciliation to work

Benjamin Mahler Fri, 18 Apr 2014 12:05:44 -0700

>
> So task reconciliation will always tell me if a task is finished when the
> slave is still running



No, as that would imply we kept infinite task history in the Master. As
soon as you get your final update, assume you will get TASK_LOST for
subsequent reconciliations.

If so, these semantics are very convenient for
> frameworks that fail to failover in a timely manner, and then ask for tasks
> that belonged to their previous FrameworkID.


What problem are you solving here? That sounds a bit bizarre because we try
to provide isolation between frameworks and thus we try to avoid leaking
information across frameworks. I realized this is what Vinod mentioned,
although the API doesn't allow you to ask about a different framework's
tasks.

But taking a step back, why don't you set an infinite failover timeout for
your framework if you want to make sure your tasks can be recovered?



On Fri, Apr 18, 2014 at 11:20 AM, David Greenberg <dsg123456...@gmail.com>wrote:

> So task reconciliation will always tell me if a task is finished when the
> slave is still running, and it will give me TASK_LOST if the slave or task
> is unknown to the master? If so, these semantics are very convenient for
> frameworks that fail to failover in a timely manner, and then ask for tasks
> that belonged to their previous FrameworkID.
>
>
> On Fri, Apr 18, 2014 at 1:55 PM, Benjamin Mahler
> <benjamin.mah...@gmail.com>wrote:
>
> > Vinod, David is asking about tasks that "belong" to the framework in that
> > they were "launched" by it, in which case your answer is not correct. We
> > don't keep track of tasks so we don't know whether the task "belongs" to
> > the framework in this sense.
> >
> > David, you will either receive TASK_LOST or nothing (if the slave for
> > the task is in a transient state).
> >
> > This is determined more so by the SlaveID than the TaskID as the Master
> > does not persistently track tasks.
> >
> > (a) If you're asking about an unknown slave, you will get TASK_LOST.
> > (b) If you're asking about a known slave and an unknown task, you will
> get
> > TASK_LOST.
> > (c) If you're asking about a known slave and a known task with a
> different
> > state, you will be sent the latest state.
> >
> > If you consider these semantics, you'll realize that you may receive
> > TASK_LOST if you try to reconcile your task that finished correctly. This
> > is why I mentioned the need to persist updates in (1) above. Let's say
> you
> > receive a terminal update of TASK_FINISHED and then you still try to
> > reconcile against a failed over Master. This new Master will reply with
> > TASK_LOST because it is unaware of the task/slave. So, you will always
> > receive your valid terminal update before getting a TASK_LOST from
> > reconciliation.
> >
> >
> > On Fri, Apr 18, 2014 at 10:46 AM, Vinod Kone <vinodk...@gmail.com>
> wrote:
> >
> >> If a framework asks to reconcile a task that doesn't belong to it there
> >> would be no response from the master. This is nice because it avoids
> >> information leak between frameworks.
> >>
> >>
> >> On Fri, Apr 18, 2014 at 5:04 AM, David Greenberg <
> dsg123456...@gmail.com
> >> >wrote:
> >>
> >> > Piggybacking onto this thread with a follow up question: what happens
> if
> >> > you ask the master to reconcile some tasks that weren't launched by
> your
> >> > framework? Will you get messages that express those tasks were
> unknown,
> >> > lost, or will nothing respond?
> >> >
> >> >
> >> > On Thursday, April 17, 2014, Sharma Podila <spod...@netflix.com>
> wrote:
> >> >
> >> >> No problem, I have a better understanding now.
> >> >> And it was useful to see the three items you listed explicitly.
> >> >>
> >> >>
> >> >> On Thu, Apr 17, 2014 at 2:39 PM, Benjamin Mahler <
> >> >> benjamin.mah...@gmail.com> wrote:
> >> >>
> >> >> Good to see you were playing around with reconciliation, we should
> have
> >> >> made the current semantics more clear. Especially in light of the
> fact
> >> that
> >> >> it's not implemented fully until one uses a strict registrar (likely
> >> >> 0.20.0).
> >> >>
> >> >> Think of reconciliation as the fallback mechanism to ensure that
> state
> >> is
> >> >> consistent, it's not designed to be something to inform you of things
> >> you
> >> >> were already told (in this case, that the tasks were running).
> >> Although we
> >> >> could consider sending updates even when task state remains the same.
> >> >>
> >> >>
> >> >> For the purpose of this conversation, let's say we're in the 0.20.0
> >> >> world, operating with the registrar. And let's assume your goal is to
> >> build
> >> >> a highly available framework (I will be documenting how to do this
> for
> >> >> 0.20.0):
> >> >>
> >> >> (1) *When you receive a status update, you must persist this
> >> information
> >> >> before returning from the statusUpdate() callback*. Once you return
> >> from
> >>
> >> >> the callback, the driver will acknowledge the slave directly. Slaves
> >> will
> >> >> retry status update delivery *until* the acknowledgement is received
> >> from
> >> >> the scheduler driver in order to ensure that the framework processed
> >> the
> >> >> update.
> >> >>
> >> >> (2) *When you receive a "slave lost" signal, it means that your tasks
> >> >> that were running on that slave are in state TASK_LOST*, and any
> >>
> >> >> reconciliation you perform for these tasks will result in a reply of
> >> >> TASK_LOST. Most of the time we'll deliver these TASK_LOST
> >> automatically,
> >> >> but with a confluence of Master *and* Slave failovers, we are unaware
> >> of
> >> >> which tasks were running on the slave as we do not persist this
> >> information
> >> >> in the Master.
> >> >>
> >> >> (3) To guarantee that you have a consistent view of task states. *You
> >> >> must also periodically reconcile task state against the Master*. This
> >> is
> >>
> >> >> only because the delivery of the "slave lost" signal in (2) is not
> >> reliable
> >> >> (the Master could failover after removing a slave but before telling
> >> >> frameworks that the slave was lost).
> >> >>
> >> >> You'll notice that this model forces one to serially persist all
> status
> >> >> update changes. We are planning to expose mechanisms to allow "batch"
> >> >> acknowledgement of status updates in the lower-level API that benh
> has
> >> >> given talks about. With a lower-level API, it is possible to build
> more
> >> >> powerful libraries that hide much of these details!
> >> >>
> >> >> You'll also perhaps notice that only (1) and (3) are strictly
> required
> >> >> for consistency, but (2) is highly recommended as the vast majority
> of
> >> the
> >> >> time the "slave lost" signal will be delivered and you can take
> action
> >> >> quickly, without having to rely on periodic reconciliation.
> >> >>
> >> >> Please let me know if anything here was not clear!
> >> >>
> >> >>
> >> >> On Thu, Apr 17, 2014 at 1:47 PM, Sharma Podila <spod...@netflix.com
> >> >wrote:
> >> >>
> >> >> Should've looked at the code before sending the previous email...
> >> >>  master/main.cpp confirmed what I needed to know. It doesn't look
> like
> >> I
> >> >> will be able to use reconcileTasks the way I thought I could.
> >> Effectively,
> >> >> a lack of callback could either mean that the master agrees with the
> >> >> requested reconcile task state, or that the task and/or slave is
> >> currently
> >> >> unknown. Which makes it an unreliable source of data. I understand
> >> this is
> >> >> expected to improve later by leveraging the registrar, but, I suspect
> >> >> there's more to it.
> >> >>
> >> >> I take it then that individual frameworks need to have their own
> >> >> mechanisms to ascertain the state of their tasks.
> >> >>
> >> >>
> >> >> On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila <spod...@netflix.com
> >> >wrote:
> >> >>
> >> >> Hello
> >> >>
> >> >>
> >>
> >
> >
>

Re: Trying to get task reconciliation to work

Reply via email to