Re: Reconciliation Document

Benjamin Mahler Mon, 03 Nov 2014 10:48:06 -0800

Thanks! Do you have the master logs?

On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
sschlans...@opentable.com> wrote:


> Hi,
> I'm the poor end user in question :)
>
> I have the Singularity logs from task reconciliation saved here:
>
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
>
> The last line in the log file sums it up pretty well -
> INFO  [2014-10-30 19:24:21,948]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
> reconciliation ended after 50 checks and 25:00.188
>
> On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <benjamin.mah...@gmail.com>
> wrote:
>
> > I don't think this is related to your retry timeout, but it's very
> difficult to diagnose this without logs or a more thorough description of
> what occurred. Do you have the logs?
> >
> > user saw it take 30 minutes to eventually reconcile 25 task statuses
> >
> > What exactly did the user see to infer this that this was related to
> reconciling the statuses?
> >
> > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <wsoren...@hubspot.com>
> wrote:
> > Ben,
> >
> > What's a reasonable initial timeout and cap for reconciliation when the
> # of slaves and tasks involved is in the tens/hundreds?
> >
> > I ask because in Singularity we are using a fixed 30 seconds and one
> user saw it take 30 minutes to eventually reconcile 25 task statuses (after
> seeing all slaves crash and a master failover -- although that's another
> issue.)
> >
> >
> >
> >
> >
> > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> benjamin.mah...@gmail.com> wrote:
> > Inline.
> >
> > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <spod...@netflix.com>
> wrote:
> > Response inline, below.
> >
> > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> benjamin.mah...@gmail.com> wrote:
> > Thanks for the thoughtful questions, I will take these into account in
> the document.
> >
> > Addressing each question in order:
> >
> > (1) Why the retry?
> >
> > It could be once per (re-)registration in the future.
> >
> > Some requests are temporarily unanswerable. For example, if reconciling
> task T on slave S, and slave S has not yet re-registered, we cannot reply
> until the slave is re-registered or removed. Also, if a slave is
> transitioning (being removed), we want to make sure that operation finishes
> before we can answer.
> >
> > It's possible to keep the request around and trigger an event once we
> can answer. However, we chose to drop and remain silent for these tasks.
> This is both for implementation simplicity and as a defense against OOMing
> from too many pending reconciliation requests.
> >
> > I was thinking that the state machine that maintains the state of tasks
> always has answers for the current state. Therefore, I don't expect any
> blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
> >
> > Task state in Mesos is persisted in the leaves of the system (the
> slaves) for scalability reasons. So when a new master starts up, it doesn't
> know anything about tasks; this state is bootstrapped from the slaves as
> they re-register. This interim period of state recovery is when frameworks
> may not receive answers to reconciliation requests, depending on whether
> the particular slave has re-registered.
> >
> > In your second case, once a slave is removed, we will send the LOST
> update for all non-terminal tasks on the slave. There's little benefit of
> replying to a reconciliation request while it's being removed, because LOST
> updates are coming shortly thereafter. You can think of these LOST updates
> as the reply to the reconciliation request, as far as the scheduler is
> concerned.
> >
> > I think the two takeaways here are:
> >
> > (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
> >
> > (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
> >
> >
> >
> >
> >
> > (2) Any time bound guarantees?
> >
> > No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
> >
> > This is why exponential backoff is important, to tolerate variability in
> timing and avoid snowballing if a backlog ever occurs.
> >
> > For suggesting an initial timeout, I need to digress a bit. Currently
> the driver does not explicitly expose the event queue to the scheduler, and
> so when you call reconcile, you may have an event queue in the driver full
> of status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
> >
> > We were considering exposing Java bindings for the newer Event/Call API.
> It makes the queue explicit, which lets you avoid reconciling while you
> have a queue full of updates.
> >
> > Here is what the C++ interface looks like:
> >
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> >
> > Does this interest you?
> >
> > I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
> >
> > You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
> >
> >
> >
> >
> >
> > (3) After timeout with no answer, I would be tempted to kill the task.
> >
> > You will eventually receive an answer, so if you decide to kill the task
> because you have not received an answer soon enough, you may make the wrong
> decision. This is up to you.
> >
> > In particular, I would caution against making decisions without feedback
> because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
> >
> > I would recommend that you only use this kind of timeout as a last
> resort, when not receiving a response after a large amount of time and a
> large number of reconciliation requests.
> >
> > Yes, that is the timeout value I was after. However, based on my
> response to #1, this could be short, isn't it?
> >
> > Yes it could be on the order of seconds to start with.
> >
> >
> >
> >
> >
> > (4) Does rate limiting affect this?
> >
> > When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
> >
> > That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for
> example)? This would keep Mesos master busy sending reconcile task updates
> to it constantly.
> >
> > You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
> >
> >
> > Thanks.
> >
> > Sharma
> >
> >
> >
> >
> > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <spod...@netflix.com>
> wrote:
> > Looks like a good step forward.
> >
> > What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
> >
> > Are there time bound guarantees within which a task update will be sent
> out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
> >
> > Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
> >
> > Thanks.
> >
> >
> > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> benjamin.mah...@gmail.com> wrote:
> > Hi all,
> >
> > I've sent a review out for a document describing reconciliation, you can
> see the draft here:
> > https://gist.github.com/bmahler/18409fc4f052df43f403
> >
> > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> > https://reviews.apache.org/r/26669/
> >
> > Thanks!
> > Ben
> >
> >
> >
> >
> >
> >
>
>

Re: Reconciliation Document

Reply via email to