> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote: > > Did we get to a conclusion regarding case 1)? and could we write a test > > which exercises the new scenarios? > > Brenden Matthews wrote: > If I get some time, I'll write a test. I've been testing it in > production for a few days though. > > Not sure about consensus. Would like to hear from the others. > > Benjamin Hindman wrote: > Regarding Case 1, is the framework not receiving the status updates from > the slave? That seems more severe. When we added reconcileTasks we > specifically decided that we would not send status updates for all possible > tasks precisely because we could get into some incorrect situations. > > Regarding Case 2, why is a framework losing track of running tasks? > That's either a bug in the framework or it isn't keeping track of tasks in > the first place. Maybe we need a different API call that returns the list of > tasks and statuses that the master knows about? > > Brenden Matthews wrote: > The original problem I tried to solve with this actually turned out to be > caused by a bug in marathon ( > https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 > ). > > That said, it seems confusing to me that the reconciliation wouldn't > include updates for tasks which either the master or the framework don't know > about. > > I'm fine with also having a separate API call. What about using the > status timestamps to avoid some of the incorrect situations? > > Niklas Nielsen wrote: > Is this patch still relevant? It seems that improving reconciliation > guarantees is already a part of the post-registrar tasks. If not, can we drop > it? :)
We can drop this for now, though I'll probably keep it in my branch. Hopefully we can get all the reconciliation reconciled. - Brenden ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15745/#review29305 ----------------------------------------------------------- On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/15745/ > ----------------------------------------------------------- > > (Updated Nov. 22, 2013, 12:30 a.m.) > > > Review request for mesos and Niklas Nielsen. > > > Repository: mesos-git > > > Description > ------- > > Fixed some task reconciliation cases. > > Case 1: > > If a slave is known but the task cannot be found, we should assume that > the task has been lost. It's possible that the following events > occurred: > > 1) Framework disconnected from master > 2) Master terminated framework's tasks > 3) Framework reconnects to master, and (incorrectly) assumes tasks are > still running > > Case 2: > > If a framework loses track of running tasks, the master should inform > the framework of which tasks it knows to be running, in addition to any > which have had a state change. > > Review: https://reviews.apache.org/r/15745 > > > Diffs > ----- > > src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 > > Diff: https://reviews.apache.org/r/15745/diff/ > > > Testing > ------- > > `make check` & tested in staging cluster. > > > Thanks, > > Brenden Matthews > >
