> On Jan. 27, 2014, 10:44 p.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 725-728 > > <https://reviews.apache.org/r/16724/diff/3/?file=425219#file425219line725> > > > > I think you've brought this up before but did you figure out why a > > completed executor has terminated tasks? > > Adam B wrote: > Not exactly, not yet. I'll look into this as I'm writing and running > tests.
Reproduced the flakiness in a unit test; smells like a race condition. I'll dig into it further as a part of MESOS-906. - Adam ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16724/#review32961 ----------------------------------------------------------- On Feb. 18, 2014, 6:07 p.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16724/ > ----------------------------------------------------------- > > (Updated Feb. 18, 2014, 6:07 p.m.) > > > Review request for mesos, Benjamin Hindman, Ben Mahler, Niklas Nielsen, and > Vinod Kone. > > > Bugs: MESOS-767 > https://issues.apache.org/jira/browse/MESOS-767 > > > Repository: mesos-git > > > Description > ------- > > Added completed frameworks/tasks to slave re-registration. > Fixes MESOS-767. > > Additional issues discovered during investigation: > - MESOS-905: Remove Framework.id in favor of FrameworkInfo.id > - MESOS-906: Last task in Completed Framework never graduates from > terminatedTasks to completedTasks. > - Completed frameworks/executors/tasks are stored in circular buffers, > and these may overflow in different orders on different slaves. > BenH proposes an archive to replace these circular buffers. > > > Diffs > ----- > > include/mesos/scheduler.hpp 2e4707e > src/master/master.hpp 7649737 > src/master/master.cpp 77872ec > src/messages/messages.proto 922a8c4 > src/slave/slave.cpp 2d21e16 > src/tests/fault_tolerance_tests.cpp 60e06cc > src/tests/mesos.hpp d7bdaee > > Diff: https://reviews.apache.org/r/16724/diff/ > > > Testing > ------- > > make check; manually failed-over a master, watched the slave reregister its > completed frameworks, web UI shows completed tasks and stdout/stderr. > Added a new unit/integration test to verify the expected behavior. > > > Thanks, > > Adam B > >