> On Nov. 22, 2013, 12:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test 
> > which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in 
> production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.
> 
> Benjamin Hindman wrote:
>     Regarding Case 1, is the framework not receiving the status updates from 
> the slave? That seems more severe. When we added reconcileTasks we 
> specifically decided that we would not send status updates for all possible 
> tasks precisely because we could get into some incorrect situations.
>     
>     Regarding Case 2, why is a framework losing track of running tasks? 
> That's either a bug in the framework or it isn't keeping track of tasks in 
> the first place. Maybe we need a different API call that returns the list of 
> tasks and statuses that the master knows about?
> 
> Brenden Matthews wrote:
>     The original problem I tried to solve with this actually turned out to be 
> caused by a bug in marathon ( 
> https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4
>  ).
>     
>     That said, it seems confusing to me that the reconciliation wouldn't 
> include updates for tasks which either the master or the framework don't know 
> about.
>     
>     I'm fine with also having a separate API call.  What about using the 
> status timestamps to avoid some of the incorrect situations?

Is this patch still relevant? It seems that improving reconciliation guarantees 
is already a part of the post-registrar tasks. If not, can we drop it? :)


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 21, 2013, 4:30 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 4:30 p.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Reply via email to