> On Oct. 10, 2014, 11:24 a.m., Vinod Kone wrote: > > src/slave/slave.cpp, line 1413 > > <https://reviews.apache.org/r/23912/diff/8/?file=716689#file716689line1413> > > > > It is weird to me that you remove the task here but (potentially) > > remove the executor up in _runTask(). It's not obvious to me why you made > > that choice. If there is a good reason, please add a comment here. > > Bernd Mathiske wrote: > The task is removed in killTask(), because later on the information that > it should be removed is no longer easily at hand. > > I looked at lines 1194, 1195 in _runTask(): > Framework* framework = getFramework(frameworkId); > CHECK_NOTNULL(framework); > > If I removed the framework earlier than this, this check would fire. If I > want to avoid that, I need to write extra code (e.g. prevent _runTask() from > happening). Assembling all framework removals in the second half of > _runTask() looks like good defensive practice to me. > > There is no harm in removing the framework in _runTask() vs. killTask() > since _runTask() must eventually happen. But by waiting with removing the > framework until _runTask(), we do not need to think about what implications > concurrent starting and killing of multiple tasks with the same executor > might have. > > Vinod Kone wrote: > I don't think I follow what you mean by > > "The task is removed in killTask(), because later on the information that > it should be removed is no longer easily at hand" > > and > > "...we do not need to think about what implications concurrent starting > and killing of multiple tasks with the same executor might have" > > > Anyway, here's why I think we should remove the framework in killTask(). > > It doesn't require non-local reasoning. Right now, someone reading the > killTask() method should know that a pending task means a "_runTask()" > dispatch is in the queue and that the executor and framework will be removed > in '_runTask()'. Instead if killTask() can remove the framework right away, > it's easy to understand. > > It is much simple for methods that take "FrameworkID" (instead of > Framework*) to check if the framework exists. If the framework doesn't exist, > for whatever reason, it could just bail. > > I think these semantics would make future contributors' life easy when > making changes to these methods. > > Changing the CHECK in '_runTask()' to an if condition doesn't seem too > bad to me. > > ``` > _runTask() > { > if (framework == NULL) { > LOG(WARNING) << "Ignoring run task " << task........ > << " because the framework does not exist"; > > return; > } > } > ``` > > Makes sense? > > Bernd Mathiske wrote: > This is more confusing for the user. He will then ask himself "why does > the framework not exist"? And then have to track back in the log to find that > the task has been killed. Making it simpler for users trumps making it > simpler for us Mesos developers IMHO.
I tried your approach and it worked after moving the statusUpdate in killTask ahead of removing the framework. - Bernd ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23912/#review56016 ----------------------------------------------------------- On Oct. 15, 2014, 1:06 a.m., Bernd Mathiske wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/23912/ > ----------------------------------------------------------- > > (Updated Oct. 15, 2014, 1:06 a.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-947 > https://issues.apache.org/jira/browse/MESOS-947 > > > Repository: mesos-git > > > Description > ------- > > Fixes MESOS-947 "Slave should properly handle a killTask() that arrives > between runTask() and _runTask()". > > Slave::killTask() did not check for task in question combination to be > "pending" (i.e. Slave::runTask had happened, but Slave::_runTask had not yet) > and then erroneously assumed that Slave::runTask() had not been executed. The > task was then marked "LOST" instead of "KILLED". But Slave::runTask had > already scheduled Slave::_runTask to follow. Now the entry for being > "pending" is removed, and the task is marked "KILLED", and _runTask gets > informed about this. It checks whether the task in question is currently > "pending" and if it is not, then it infers that the task has been killed and > does not erroneously try to complete launching it. > > > Diffs > ----- > > src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c > src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 > src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 > src/tests/mesos.cpp 3dcb2acd5ad4ab5e3a7b4fe524ee077558112773 > src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f > > Diff: https://reviews.apache.org/r/23912/diff/ > > > Testing > ------- > > Wrote a unit test that reliably created the situation described in the > ticket. Observed that TASK_LOST and the listed log output occurred. This > pointed directly to the lines in killTask() where the problem is rooted. Ran > the test after fixing, it succeeded. Checked the log. It looks like a "clean > kill" now :-) > > > Thanks, > > Bernd Mathiske > >
