Okay - that only solves half of the problem for us: users will still see their frameworks as running even though they completed but it is a first step.
Let's continue the discussion in a JIRA ticket; I'll create one shortly. Thanks for helping out! Niklas On 15 September 2014 18:17, Benjamin Mahler <benjamin.mah...@gmail.com> wrote: > On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen <nik...@mesosphere.io> > wrote: > > > Thanks for your input Ben! (Comments inlined) > > > > On 15 September 2014 12:35, Benjamin Mahler <benjamin.mah...@gmail.com> > > wrote: > > > > > To ensure that the architecture of mesos remains a scalable one, we > want > > to > > > persist state in the leaves of the system as much as possible. This is > > why > > > the master has never persisted tasks, task states, or status updates. > > Note > > > that status updates can contain arbitrarily large amounts of data at > the > > > current time (example impact of this: > > > https://issues.apache.org/jira/browse/MESOS-1746). > > > > > > Adam, I think solution (2) you listed above of including a terminal > > > indicator in the _private_ message between slave and master would > easily > > > allow us to release the resources at the correct time. We would still > > hold > > > the correct task state in the master, and would maintain the status > > update > > > stream invariants for frameworks (guaranteed ordered delivery). This > > would > > > be simpler to implement with my recent change here, because you no > longer > > > have to remove the task to release the resources: > > > https://reviews.apache.org/r/25568/ > > > > > > "We should still hold the correct task state"; meaning the actual state > of > > the task on the slave? > > > > Correct, as in, just maintaining the existing task state semantics in the > master (for correctness of reconciliation / status update ordering). > > > > Then the auxiliary field should represent "last known" status, which may > > not necessarily be terminal. > > For example, a staging update followed by a running update while the > > framework is disconnected will show as staging still - or am I missing > > something? > > > > It would still show as staging, the running update would never be sent to > the master since the slave is not receiving acknowledgements. But if there > were a terminal task, then the slave would be setting the auxiliary field > in the StatusUpdateMessage and the master will know to release the > resources. > > + backwards compatibility > > > > > > > > > > > > > > > Longer term, adding pipelining of status updates would be a nice > > > improvement (similar to what you listed in (1) above). But as Vinod > said, > > > it will require care to ensure we maintain the stream invariants for > > > frameworks (i.e. probably need to send multiple updates in 1 message). > > > > > > How does this sound? > > > > > > > Sounds great - we would love to help out with (2). Would you be up for > > shepherding such a change? > > > > Yep! > > The changes needed should be based off of > https://reviews.apache.org/r/25568/ since it changes the resource > releasing > in the master. > > > > > > > > > > > > On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon <a...@mesosphere.io> > > > wrote: > > > > > > > Definitely relevant. If the master could be trusted to persist all > the > > > > task status updates, then they could be queued up at the master > instead > > > of > > > > the slave once the master has acknowledged its receipt. Then the > master > > > > could have the most up-to-date task state and can recover the > resources > > > as > > > > soon as it receives a terminal update, even if the framework is > > > > disconnected and unable to receive/ack the status updates. Then, once > > the > > > > framework reconnects, the master will be responsible for sending its > > > queued > > > > status updates. We will still need a queue on the slave side, but > only > > > for > > > > updates that the master has not persisted and ack'ed, primarily > during > > > the > > > > scenario when the slave is disconnected from the master. > > > > > > > > On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone <vinodk...@gmail.com> > > > wrote: > > > > > > > >> The semantics of these changes would have an impact on the upcoming > > task > > > >> reconciliation. > > > >> > > > >> @BenM: Can you chime in here on how this fits into the task > > > reconciliation > > > >> work that you've been leading? > > > >> > > > >> On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon <a...@mesosphere.io> > > > >> wrote: > > > >> > > > >> > I agree with Niklas that if the executor has sent a terminal > status > > > >> update > > > >> > to the slave, then the task is done and the master should be able > to > > > >> > recover those resources. Only sending the oldest status update to > > the > > > >> > master, especially in the case of framework failover, prevents > these > > > >> > resources from being recovered in a timely manner. I see a couple > of > > > >> > options for getting around this, each with their own > disadvantages. > > > >> > 1) Send the entire status update stream to the master. Once the > > master > > > >> sees > > > >> > the terminal status update, it will removeTask and recover the > > > >> resources. > > > >> > Future resends of the update will be forwarded to the scheduler, > but > > > the > > > >> > master will ignore (with warning and invalid_update++ metrics) the > > > >> > subsequent updates as far as its own state for the removed task is > > > >> > concerned. Disadvantage: Potentially sends a lot of status update > > > >> messages > > > >> > until the scheduler reregisters and acknowledges the updates. > > > >> > Disadvantage2: Updates could be sent to the scheduler out of order > > if > > > >> some > > > >> > updates are dropped between the slave and master. > > > >> > 2) Send only the oldest status update to the master, but with an > > > >> annotation > > > >> > of the final/terminal state of the task, if any. That way the > master > > > can > > > >> > call removeTask to update its internal state for the task (and > > update > > > >> the > > > >> > UI) and recover the resources for the task. While the scheduler is > > > still > > > >> > down, the oldest update will continue to be resent and forwarded, > > but > > > >> the > > > >> > master will ignore the update (with a warning as above) as far as > > its > > > >> own > > > >> > internal state is concerned. When the scheduler reregisters, the > > > update > > > >> > stream will be forwarded and acknowledged one-at-a-time as before, > > > >> > guaranteeing status update ordering to the scheduler. > Disadvantage: > > > >> Seems a > > > >> > bit hacky to tack a terminal state onto a running update. > > > Disadvantage2: > > > >> > State endpoint won't show all the status updates until the entire > > > stream > > > >> > actually gets forwarded+acknowledged. > > > >> > Thoughts? > > > >> > > > > >> > > > > >> > On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone <vinodk...@gmail.com> > > > >> wrote: > > > >> > > > > >> > > The main reason is to keep status update manager simple. Also, > it > > is > > > >> very > > > >> > > easy to enforce the order of updates to the master/framework in > > this > > > >> > model. > > > >> > > If we allow multiple updates for a task to be in flight, it's > > really > > > >> hard > > > >> > > (impossible?) to ensure that we are not delivering out-of-order > > > >> updates > > > >> > > even in edge cases (failover, network partitions etc). > > > >> > > > > > >> > > On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen < > > > nik...@mesosphere.io > > > >> > > > > >> > > wrote: > > > >> > > > > > >> > > > Hey Vinod - thanks for chiming in! > > > >> > > > > > > >> > > > Is there a particular reason for only having one status in > > flight? > > > >> Or > > > >> > to > > > >> > > > put it in another way, isn't that too strict behavior taken > that > > > the > > > >> > > master > > > >> > > > state could present the most recent known state if the status > > > update > > > >> > > > manager tried to send more than the front of the stream? > > > >> > > > Taken very long timeouts, just waiting for those to disappear > > > seems > > > >> a > > > >> > bit > > > >> > > > tedious and hogs the cluster. > > > >> > > > > > > >> > > > Niklas > > > >> > > > > > > >> > > > On 10 September 2014 17:18, Vinod Kone <vinodk...@gmail.com> > > > wrote: > > > >> > > > > > > >> > > > > What you observed is expected because of the way the slave > > > >> > > (specifically, > > > >> > > > > the status update manager) operates. > > > >> > > > > > > > >> > > > > The status update manager only sends the next update for a > > task > > > >> if a > > > >> > > > > previous update (if it exists) has been acked. > > > >> > > > > > > > >> > > > > In your case, since TASK_RUNNING was not acked by the > > framework, > > > >> > master > > > >> > > > > doesn't know about the TASK_FINISHED update that is queued > up > > by > > > >> the > > > >> > > > status > > > >> > > > > update manager. > > > >> > > > > > > > >> > > > > If the framework never comes back, i.e., failover timeout > > > elapses, > > > >> > > master > > > >> > > > > shuts down the framework, which releases those resources. > > > >> > > > > > > > >> > > > > On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen < > > > >> > nik...@mesosphere.io> > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > Here is the log of a mesos-local instance where I > reproduced > > > it: > > > >> > > > > > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here > task > > > 10 > > > >> to > > > >> > 19 > > > >> > > > are > > > >> > > > > > stuck in running state). > > > >> > > > > > There is a lot of output, so here is a filtered log for > task > > > 10: > > > >> > > > > > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d > > > >> > > > > > > > > >> > > > > > At first glance, it looks like the task can't be found > when > > > >> trying > > > >> > to > > > >> > > > > > forward the finish update because the running update never > > got > > > >> > > > > acknowledged > > > >> > > > > > before the framework disconnected. I may be missing > > something > > > >> here. > > > >> > > > > > > > > >> > > > > > Niklas > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On 10 September 2014 16:09, Niklas Nielsen < > > > >> nik...@mesosphere.io> > > > >> > > > wrote: > > > >> > > > > > > > > >> > > > > > > Hi guys, > > > >> > > > > > > > > > >> > > > > > > We have run into a problem that cause tasks which > > completes, > > > >> > when a > > > >> > > > > > > framework is disconnected and has a fail-over time, to > > > remain > > > >> in > > > >> > a > > > >> > > > > > running > > > >> > > > > > > state even though the tasks actually finishes. > > > >> > > > > > > > > > >> > > > > > > Here is a test framework we have been able to reproduce > > the > > > >> issue > > > >> > > > with: > > > >> > > > > > > https://gist.github.com/nqn/9b9b1de9123a6e836f54 > > > >> > > > > > > It launches many short-lived tasks (1 second sleep) and > > when > > > >> > > killing > > > >> > > > > the > > > >> > > > > > > framework instance, the master reports the tasks as > > running > > > >> even > > > >> > > > after > > > >> > > > > > > several minutes: > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png > > > >> > > > > > > > > > >> > > > > > > When clicking on one of the slaves where, for example, > > task > > > 49 > > > >> > > runs; > > > >> > > > > the > > > >> > > > > > > slave knows that it completed: > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png > > > >> > > > > > > > > > >> > > > > > > The tasks only finish when the framework connects again > > > >> (which it > > > >> > > may > > > >> > > > > > > never do). This is on Mesos 0.20.0, but also applies to > > HEAD > > > >> (as > > > >> > of > > > >> > > > > > today). > > > >> > > > > > > Do you guys have any insights into what may be going on > > > here? > > > >> Is > > > >> > > this > > > >> > > > > > > by-design or a bug? > > > >> > > > > > > > > > >> > > > > > > Thanks, > > > >> > > > > > > Niklas > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >