Thanks! Do you have the master logs? On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker < sschlans...@opentable.com> wrote:
> Hi, > I'm the poor end user in question :) > > I have the Singularity logs from task reconciliation saved here: > > https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt > > The last line in the log file sums it up pretty well - > INFO [2014-10-30 19:24:21,948] > com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task > reconciliation ended after 50 checks and 25:00.188 > > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <benjamin.mah...@gmail.com> > wrote: > > > I don't think this is related to your retry timeout, but it's very > difficult to diagnose this without logs or a more thorough description of > what occurred. Do you have the logs? > > > > user saw it take 30 minutes to eventually reconcile 25 task statuses > > > > What exactly did the user see to infer this that this was related to > reconciling the statuses? > > > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <wsoren...@hubspot.com> > wrote: > > Ben, > > > > What's a reasonable initial timeout and cap for reconciliation when the > # of slaves and tasks involved is in the tens/hundreds? > > > > I ask because in Singularity we are using a fixed 30 seconds and one > user saw it take 30 minutes to eventually reconcile 25 task statuses (after > seeing all slaves crash and a master failover -- although that's another > issue.) > > > > > > > > > > > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler < > benjamin.mah...@gmail.com> wrote: > > Inline. > > > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <spod...@netflix.com> > wrote: > > Response inline, below. > > > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler < > benjamin.mah...@gmail.com> wrote: > > Thanks for the thoughtful questions, I will take these into account in > the document. > > > > Addressing each question in order: > > > > (1) Why the retry? > > > > It could be once per (re-)registration in the future. > > > > Some requests are temporarily unanswerable. For example, if reconciling > task T on slave S, and slave S has not yet re-registered, we cannot reply > until the slave is re-registered or removed. Also, if a slave is > transitioning (being removed), we want to make sure that operation finishes > before we can answer. > > > > It's possible to keep the request around and trigger an event once we > can answer. However, we chose to drop and remain silent for these tasks. > This is both for implementation simplicity and as a defense against OOMing > from too many pending reconciliation requests. > > > > I was thinking that the state machine that maintains the state of tasks > always has answers for the current state. Therefore, I don't expect any > blocking. For example, if S hasn't yet re-registered. the state machine > must think that the state of T is still 'running' until either the slave > re-registers and informs of the task being lost, or a timeout occurs after > which master decides the slave is gone. At which point a new status update > can be sent. I don't see a reason why reconcile needs to wait until slave > re-registers here. Maybe I am missing something else? Same with > transitioning... the state information is always available, say, as > running, until transition happens. This results in two status updates, but > always correct. > > > > Task state in Mesos is persisted in the leaves of the system (the > slaves) for scalability reasons. So when a new master starts up, it doesn't > know anything about tasks; this state is bootstrapped from the slaves as > they re-register. This interim period of state recovery is when frameworks > may not receive answers to reconciliation requests, depending on whether > the particular slave has re-registered. > > > > In your second case, once a slave is removed, we will send the LOST > update for all non-terminal tasks on the slave. There's little benefit of > replying to a reconciliation request while it's being removed, because LOST > updates are coming shortly thereafter. You can think of these LOST updates > as the reply to the reconciliation request, as far as the scheduler is > concerned. > > > > I think the two takeaways here are: > > > > (1) Ultimately while it is possible to avoid the need for retries on the > framework side, it introduces too much complexity in the master and gives > us no flexibility in ignoring or dropping messages. Even in such a world, > the retries would be a valid resiliency measure for frameworks to insulate > themselves against anything being dropped. > > > > (2) For now, we want to encourage framework developers to think about > these kinds of issues, we want them to implement their frameworks in a > resilient manner. And so in general we haven't chosen to provide a crutch > when it requires a lot of complexity in Mesos. Today we can't add these > ergonomic improvements in the scheduler driver because it has no > persistence. Hopefully as the project moves forward, we can have these kind > of framework side ergonomic improvements be contained in pure language > bindings to Mesos. A nice stateful language binding can hide this from you. > :) > > > > > > > > > > > > (2) Any time bound guarantees? > > > > No guarantees on exact timing, but you are guaranteed to eventually > receive an answer. > > > > This is why exponential backoff is important, to tolerate variability in > timing and avoid snowballing if a backlog ever occurs. > > > > For suggesting an initial timeout, I need to digress a bit. Currently > the driver does not explicitly expose the event queue to the scheduler, and > so when you call reconcile, you may have an event queue in the driver full > of status updates. Because of this lack of visibility, picking an initial > timeout will depend on your scheduler's update processing speed and scale > (# expected status updates). Again, backoff is recommended to handle this. > > > > We were considering exposing Java bindings for the newer Event/Call API. > It makes the queue explicit, which lets you avoid reconciling while you > have a queue full of updates. > > > > Here is what the C++ interface looks like: > > > https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478 > > > > Does this interest you? > > > > I am interpreting this (correct me as needed) to mean that the Java > callback statusUpdate() receives a queue instead of the current version > with just one TaskStatus argument? I suppose this could be useful, yes. In > that case, the acknowledgements of receiving the task status is sent to > master once per the entire queue of task status. Which may be OK. > > > > You would always receive a queue of events, which you can store and > process asynchronously (the key to enabling this was making > acknowledgements explicit). Sorry for the tangent, keep an eye out for > discussions related to the new API / HTTP API changes. > > > > > > > > > > > > (3) After timeout with no answer, I would be tempted to kill the task. > > > > You will eventually receive an answer, so if you decide to kill the task > because you have not received an answer soon enough, you may make the wrong > decision. This is up to you. > > > > In particular, I would caution against making decisions without feedback > because it can lead to a snowball effect if tasks are treated > independently. In the event of a backlog, what's to stop you from killing > all tasks because you haven't received any answers? > > > > I would recommend that you only use this kind of timeout as a last > resort, when not receiving a response after a large amount of time and a > large number of reconciliation requests. > > > > Yes, that is the timeout value I was after. However, based on my > response to #1, this could be short, isn't it? > > > > Yes it could be on the order of seconds to start with. > > > > > > > > > > > > (4) Does rate limiting affect this? > > > > When enabled, rate limiting currently only operates on the rate of > incoming messages from a particular framework, so the number of updates > sent back has no effect on the limiting. > > > > That sounds good. Although, just to be paranoid, what if there's a > problematic framework that restarts frequently (due to a bug, for > example)? This would keep Mesos master busy sending reconcile task updates > to it constantly. > > > > You're right, it's an orthogonal problem to address since it applies > broadly to other messages (e.g. framework sending 100MB tasks). > > > > > > Thanks. > > > > Sharma > > > > > > > > > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <spod...@netflix.com> > wrote: > > Looks like a good step forward. > > > > What is the reason for the algorithm having to call reconcile tasks > multiple times after waiting some time in step 6? Shouldn't it be just once > per (re)registration? > > > > Are there time bound guarantees within which a task update will be sent > out after a reconcile request is sent? In the algorithm for task > reconciliation, what would be a good timeout after which we conclude that > we got no task update from the master? Upon such a timeout, I would be > tempted to conclude that the task has disappeared. In which case, I would > call driver.killTask() (to be sure its marked as gone), mark my task as > terminated, then submit a replacement task. > > > > Does the "rate limiting" feature (in the works?) affect task > reconciliation due to the volume of task updates sent back? > > > > Thanks. > > > > > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler < > benjamin.mah...@gmail.com> wrote: > > Hi all, > > > > I've sent a review out for a document describing reconciliation, you can > see the draft here: > > https://gist.github.com/bmahler/18409fc4f052df43f403 > > > > Would love to gather high level feedback on it from framework > developers. Feel free to reply here, or on the review: > > https://reviews.apache.org/r/26669/ > > > > Thanks! > > Ben > > > > > > > > > > > > > >